The Four Pillars of Data Engineering: Specialization for Modern Data Teams

2025
9 mins

Our Four Pillars of Data Engineering

Specialization is a naturally occurring process as industries progress and mature. Software Engineering developed as a field in the 1960s, but Data Engineering as a distinct discipline didn’t appear until the 21st Century, even though SQL and relational databases have been around since the 1980s

Now, modern and well funded data teams will often employ unique combinations of data engineers, analytics engineers, BI engineers, data analysts, business analysts, quantitative analysts, data scientists, ML engineers, AI engineers, data architects, solutions architects, analytics managers, data product managers, and on and on. Each of these roles require different amounts of specialization, which often depends on the context of their team or organization or industry. The nuance of these roles can be confusing for less mature data teams, especially in talent markets that haven’t developed that level of specialization.

We at Semantiks have simplified this landscape into four key verticals:

  • Data Engineering,
  • Analytics Engineering,
  • Machine Learning (ML) Engineering, and
  • Artificial Intelligence (AI) Engineering 

These are our Four Pillars of Data Engineering, and we will use this post to discuss the services we offer within each vertical and the tools we use to perform these tasks.

Data Engineering

Functions

Simply put, data engineering is the movement of data and the deployment and maintenance of the architecture required to support it. Our data engineering services often take the shape of developing data pipelines and deploying the tools and services on which those pipelines run. These projects typically involve migrating data from operational databases such as Postgres or MySQL, NoSQL databases like MongoDB, or third party software vendors such as Hubspot or Zendesk into a data warehouse, such as Google Bigquery. These processes are known as ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform), based on the order of operations.

Managing access to data products and integrations is another aspect of data engineering. This can take the form of an integration with a BI tool or delivering Data-as-a-Service (DaaS) products to customers. Semantiks specializes in the design and implementation of these customer facing DaaS products to help our clients unlock new revenue streams.

Tools

Data Movement

Fivetran: Fivetran is a managed service offering flexible, customizable data pipelines from 100s of sources to your data warehouse, without significant infrastructural investment. It is an ideal solution for early stage startups that do not have the necessary resources to maintain and grow an engineering platform. Fivetran offers 14 day free trials on each new connection, making it a low risk option. They also offer a free tier for low volume users. 

Airbyte: Airbyte is a competitor to Fivetran that offers a fairly similar service model. The key benefits over Fivetran are its open source option, allowing for free deployment on self hosted hardware, as well as a wider variety of data sink locations. We do find Fivetran to be slightly more reliable than Airbyte, but we have seen many happy Airbyte customers and have successfully worked with each.

GCP Datastream: Datastream is a managed Change Data Capture (CDC) tool offered by Google for copying data from operational databases into their Bigquery data warehouse, in near real time. It is a great option for existing Google Cloud customers who do not want to pay for outside services like Fivetran or Airbyte but would still benefit from a serverless, low-maintenance deployment option. We have found that the database administration and cloud networking requirements of Datastream can be more complex than those of Fivetran/Airbyte, but we have plenty of experience with managing them. 

Azure Data Factory: Data Factory provides a nice, graphical interface for designing data pipelines in Microsoft Azure. These pipelines can include automated CDC as well as bulk copies from a wider variety of sources than what Datastream offers (although GCPs Dataflow and Data Fusion are better analogs to Data Factory). In our experience, developing, maintaining, and debugging Data Factory pipelines requires significant experience and can be a challenge for early stage startups.

Bigquery Data Transfer Service: In a pinch, the Bigquery Data Transfer service can be an effective tool for multi-cloud customers who have operational data in AWS or Azure. Database snapshots can be copied into Bigquery from cloud storage buckets. This is a last resort option for early stage clients who prefer not to use third-party managed services but do not have the technical expertise to maintain a CDC tool like Datastream.

Cloud Run Functions: We have used Cloud Functions for both data ingress and data delivery use cases. It is an excellent option for any workflows that can be deployed in Function-as-a-Service frameworks, such as retrieving and returning data. This makes them flexible enough to support ETL use cases as well as serving data to customers in a DaaS model.

Looker Embedded: Looker Embedded is a framework and library for serving Looker content directly to end users through your website or application. It is a simple and effective way for monetizing data by delivering insights directly to customers. We provide demonstrations and code samples for embedding Looker, including key considerations like authentication, data privacy, and generating value for consumers.

Orchestration

Airflow: Apache Airflow is our preferred tool for automating and managing the coordination between various source systems (i.e. orchestration).  We have developed patterns for easily deploying self-hosted Airflow instances on cloud hardware, and have significant experience in developing or migrating pipelines to Airflow DAGs. The self-hosted Airflow option is in our opinion best fit for SMBs that have the resources to maintain cloud infrastructure but maybe lack the resources for a more expensive, managed implementation such as Google Cloud Composer. Larger, better funded, data teams have flexibility to choose between the self-hosted or managed routes, in our opinion.

Analytics Engineering

Functions

Analytics engineering as a discipline focuses on data modeling and development in warehouse. Analytics engineers are responsible for everything that happens after data has been moved. This can take the form of data transformation, business intelligence development, or designing data marts and other data products. It is all the work that needs to happen between the data engineering pipeline and the user facing dashboard. Building flexible, scalable data models is both an art and a science, and it is a skill that we at Semantiks pride ourselves in.

Tools

Business Intelligence

Looker: With over 7 years of Looker experience, we consider ourselves to be experts at developing, scaling, and monetizing data products in Looker. Our clients have a wide range of Looker experience, and our projects have ranged from initial Looker instance set up to feature enhancements to optimization. We excel at every phase of Looker usage, from backend development, to project modeling, to business intelligence and dashboarding.  

Data Transformation

dbt: Data Build Tool (dbt) is one of the first and most popular tools for analytics engineering workloads. It allows developers to augment basic SQL with human readable templating to deploy queries as pipelines. We like dbt because SQL is a very simple, well known language for data transformation with low technical barriers to entry. The human resources needed to develop pipelines on dbt are much cheaper than those needed for building complex pipelines in Spark or Python, for example, making it a great fit for early stage startups. dbt is open source and can be deployed for free on self-hosted hardware, or as a managed service on DBT Cloud.

Fivetran Transformations: Transformations are another Fivetran feature that can be used for automating analytics engineering workloads on data that is moved by Fivetran. There are pre-built Transformations for a variety of common software vendors, but Fivetran can also integrate with a dbt project. This allows us to automatically trigger SQL transformation after Fivetran copies our data from source. This is a really nice synergy between Fivetran and dbt, which is a great fit for early data teams that do not have engineering resources to maintain pipeline infrastructure.   

Looker Derived Tables: Derived Tables are a key component of Looker’s value proposition, allowing developers to deploy data transformations on read (i.e. when data is consumed by end users) or directly to warehouse via Persistent Derived Tables. Like DBT, Derived Tables are another managed option for deploying transformation through SQL and human readable configuration. They do not scale as well as DBT projects, especially with complex table dependencies, but we often incorporate Derived Tables as part of an initial Looker set up to minimize time to value.

ML Engineering

Functions

Machine Learning Operations, or MLOps, include everything needed to support machine learning outside of training and testing models. ML engineers are responsible for feature engineering, model deployment, model evaluation, retraining and automation, and prediction generation. Some organizations also rely on ML engineers to optimize model efficiency and performance. Our ML engineering projects typically involve model hosting and deployment for online prediction, feature engineering solutions, and/or batch prediction pipelines.

Tools

Model Training

Bigquery ML: BQML is a framework for writing and deploying machine learning algorithms through SQL code. These models live in the Bigquery Console, although execution often occurs through Google’s Vertex AI. We like BQML as a solution for training models because they live in the same system as the data used to train them, meaning there are no complicated data integration requirements. In addition, they provide a standard, user-friendly syntax that contrasts with the many different ML libraries available through Python or other programming languages.

AutoML: Vertex AI’s AutoML is a data science product for automating key features of the data science life cycle, such as parameter tuning and feature selection. AutoML models can be directly connected to tables in Bigquery or datasets in a GCS data lake, once again eliminating the need for complex data networking. We have found AutoML’s lack of explainability makes it a black box, and we prefer to use it in an exploratory capacity complimenting BQML, rather than replacing it.

Model Deployment

Vertex AI: GCP’s Vertex AI offers multiple tools useful for deploying models and serving predictions. Vertex AI’s model registry acts as an artifact repository for storing, versioning, and evaluating ML models. Managed datasets provide tooling for reusable feature set generation, from sources like Bigquery and Google Cloud Storage. Finally, Vertex AI model endpoints enable online prediction serving through a managed service. We find Vertex AI to be an excellent fit for customers looking to scale and optimize their ML Ops platform.

AI Engineering

Functions

Artificial Intelligence (AI) Engineering is the newest of our four verticals as a field, and as such is the least clearly defined. AI Ops include the development and design of tools and systems for deploying AI applications, at scale. This includes LLM and GenAI model hosting and deployment, prompt engineering, RAG development, AI model and product evaluation, and/or data transformation for AI use cases, such as vectorization. 

AI engineering is at the core of what we at Semantiks do as a company. As such, we are limited in the solutions we offer through Professional Services to avoid creating competition with our own proprietary technologies. We do, however, offer limited support for some AI engineering use cases. 

Tools

Vector Search

Firestore: Cloud Firestore, the Firebase No-SQL database, offers a native data type for vector embeddings. We have implemented patterns for automatically adding embeddings to Firestore collections to enable vector search algorithms. This offers existing Firestore customers an option for performing Retrieval-Augmented Generation (RAG) without deploying additional vector database infrastructure. 

AI Analytics

LangSmith: For existing LangSmith customers, we have implemented end-to-end solutions for engineering solutions for ingesting trace data, modeling in warehouse, and visualizing via Looker.