Blog - Our 4 Pillars

Blog - Our 4 Pillars

Blog - Our 4 Pillars

Our Four Pillars
of Data Engineering

Specialization is a natural process that occurs as industries progress and mature. Software Engineering developed as a field in the 1960s, but Data Engineering as a distinct discipline did not appear until the 21st century, even though SQL and relational databases have been around since the 1980s.

Modern, well-funded data teams now often employ unique combinations of data engineers, analytics engineers, BI engineers, data analysts, business analysts, quantitative analysts, data scientists, ML engineers, AI engineers, data architects, solutions architects, analytics managers, data product managers, and so on. Each of these roles requires different amounts of specialization, which often depends on the context of their team or organization or industry. The nuances of these roles can be confusing for less mature data teams, especially in talent markets that have not developed that level of specialization.

At Semantiks Pro Services, we have simplified this landscape into four key verticals:

  • Data Engineering

  • Analytics Engineering

  • Machine Learning (ML) Engineering

  • Artificial Intelligence (AI) Engineering


These are our Four Pillars of Data Engineering, and we will use this post to discuss the services we offer within each vertical and the tools we use to accomplish these tasks.

Data Engineering

Functions

Simply put, data engineering is the movement of data and the deployment and maintenance of the architecture needed to support it. Our data engineering services often take the form of developing data pipelines and deploying the tools and services upon which those pipelines run. These projects usually involve migrating data from operational databases such as Postgres or MySQL, NoSQL databases like MongoDB, or third-party software providers such as Hubspot or Zendesk into a data warehouse, such as Google Bigquery. These processes are referred to as ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform), depending on the order of operations.


Managing access to data products and integrations is another aspect of data engineering. This can take the form of an integration with a BI tool or the delivery of Data-as-a-Service (DaaS) products to customers. Semantiks specializes in designing and implementing these customer-facing DaaS products to help our clients unlock new revenue streams.

Tools

  1. Data Movement

Fivetran: Fivetran is a managed service that offers flexible and customizable data pipelines from hundreds of sources to your data warehouse, without requiring significant infrastructure investment. It is an ideal solution for early-stage startups that do not have the resources to maintain and grow an engineering platform. Fivetran offers 14-day free trials on every new connection, making it a low-risk option. They also offer a free tier for low-volume users.

Airbyte: Airbyte is a competitor to Fivetran that offers a quite similar service model. The key advantages over Fivetran are its open-source option, which allows free deployment on self-hosted hardware, as well as a greater variety of data sink locations. We find Fivetran to be slightly more reliable than Airbyte, but we have seen many satisfied Airbyte customers and have successfully worked with both.

GCP Datastream: Datastream is a managed Change Data Capture (CDC) tool offered by Google to copy data from operational databases to your Bigquery data warehouse, near real-time. It is a great option for existing Google Cloud customers who do not want to pay for external services like Fivetran or Airbyte, but who would still benefit from a serverless, low-maintenance deployment option. We have found Datastream’s database and cloud networking administration requirements to be more complex than Fivetran/Airbyte, but we have a lot of experience managing them.

Azure Data Factory: Data Factory provides a nice graphical interface for designing data pipelines in Microsoft Azure. These pipelines can include automated CDC, as well as bulk copies from a wider variety of sources than Datastream offers (although GCP’s Dataflow and Data Fusion are better analogs to Data Factory). In our experience, developing, maintaining, and debugging Data Factory pipelines requires significant expertise and can be challenging for early-stage startups.

Bigquery Data Transfer Service: In a pinch, the Bigquery Data Transfer service can be an effective tool for multi-cloud customers who have operational data in AWS or Azure. Database snapshots can be copied into Bigquery from cloud storage buckets. This is a last-resort option for early-stage customers who prefer not to use third-party managed services, but who do not have the technical expertise to maintain a CDC tool like Datastream.

Cloud Run Functions: We have used Cloud Functions for both data ingestion and data delivery use cases. It is an excellent option for any workflow that can be deployed within Function-as-a-Service frameworks, such as retrieving and returning data. This makes them flexible enough to support ETL use cases, as well as serving data to customers in a DaaS model.


Looker Embedded: Looker Embedded is a framework and library for serving Looker content directly to end-users via their website or application. It is a straightforward and effective way to monetize data by delivering insights directly to customers. We provide demos and code examples for embedding Looker, including key considerations such as authentication, data privacy, and generating value for consumers.

  1. Orchestration

Airflow: Apache Airflow is our preferred tool for automating and managing coordination between various source systems (i.e., orchestration). We have developed patterns for easily deploying self-hosted Airflow instances on cloud hardware, and have significant experience developing or migrating pipelines to Airflow DAGs. The self-hosted Airflow option is, in our opinion, best suited for SMEs who have the resources to maintain cloud infrastructure, but perhaps lack the resources for a more expensive managed implementation, such as Google Cloud Composer. Larger, better funded, data teams have flexibility to choose between the self-hosted or managed routes, in our opinion.

Analytics Engineering

Functions

Analytical engineering as a discipline focuses on data modeling and development in the warehouse. Analytics engineers are responsible for everything that happens after data has been moved. This can take the form of data transformation, business intelligence development, or designing data marts and other data products. It is all the work that needs to happen between the data engineering pipeline and the user-facing dashboard. Building flexible and scalable data models is both an art and a science, and it is a skill we at Semantiks pride ourselves on.

Tools

  1. Business Intelligence

Looker: With over 7 years of Looker experience, we consider ourselves experts in developing, scaling, and monetizing data products in Looker. Our clients have a wide range of Looker experience, and our projects have ranged from initial Looker instance setup to feature enhancements and optimization. We excel at every phase of Looker use, from backend development, to project modeling, to business intelligence and dashboarding.

  1. Data Transformation

Data Build Tool (dbt) is one of the earliest and most popular tools for analytical engineering workloads. It allows developers to augment basic SQL with human-readable templates to deploy queries as pipelines. We like dbt because SQL is a very simple and well-known language for data transformation with low technical barriers to entry. The human resources required to develop pipelines in dbt are much cheaper than those required to build complex pipelines in Spark or Python, for example, making it a great option for early-stage startups. dbt is open-source and can be deployed for free on self-hosted hardware, or as a managed service on DBT Cloud.


Fivetran Transformations: Transformations is another feature of Fivetran that can be used to automate analytical engineering workloads on data that is moved by Fivetran. There are pre-built transformations for a variety of common software providers, but Fivetran can also integrate with a dbt project. This allows us to automatically trigger SQL transformation after Fivetran copies our data from the source. This is a very good synergy between Fivetran and dbt, which is a great option for early data teams that do not have engineering resources to maintain pipeline infrastructure.


Looker Derived Tables: Derived tables are a key component of Looker’s value proposition, allowing developers to deploy data transformations at read (i.e., when data is consumed by end-users) or directly into the warehouse via Persistent Derived Tables. Like DBT, derived tables are another managed option for deploying transformation via SQL and human-readable configuration. They do not scale as well as DBT projects, especially with complex table dependencies, but we often incorporate derived tables as part of an initial Looker setup to minimize time to value.

Machine Learning (ML) Engineering

Functions

Machine Learning Operations, or MLOps, includes everything needed to support machine learning outside of model training and testing. ML engineers are responsible for feature engineering, model deployment, model evaluation, retraining and automation, and generating predictions. Some organizations also rely on ML engineers to optimize model efficiency and performance. Our ML engineering projects usually involve hosting and deploying models for online prediction, feature engineering solutions, and/or batch prediction pipelines.

Tools

  1. Business Intelligence

Bigquery ML: BQML is a framework for writing and deploying machine learning algorithms via SQL code. These models live in the Bigquery Console, although execution often happens via Google’s Vertex AI. We like BQML as a solution for training models because they live in the same system as the data used to train them, meaning there are no complicated data integration requirements. In addition, they provide a standard and easy-to-use syntax that contrasts with the many different ML libraries available via Python or other programming languages.

AutoML: Vertex AI’s AutoML is a data science product for automating key features of the data science lifecycle, such as parameter tuning and feature selection. AutoML models can connect directly to tables in Bigquery or datasets in a GCS data lake, once again eliminating the need for complex data networking. We have found AutoML’s lack of explainability makes it a black box, and we prefer to use it with exploratory capability complementing BQML, rather than replacing it.

  1. Model Deployment

Vertex AI: GCP’s Vertex AI offers multiple useful tools for deploying models and serving predictions. Vertex AI’s model registry acts as an artifact repository for storing, versioning, and evaluating ML models. Managed datasets provide tools for generating reusable feature sets, from sources such as Bigquery and Google Cloud Storage. Finally, Vertex AI’s model endpoints allow serving online predictions via a managed service. We consider Vertex AI to be an excellent option for customers looking to scale and optimize their ML Ops platform.

Artificial Intelligence (AI) Engineering

Functions

Artificial Intelligence (AI) Engineering is the newest of our four verticals as a field, and as such is the least clearly defined. AI Ops includes the development and design of tools and systems to deploy AI applications, at scale. This includes hosting and deploying LLM and GenAI models, prompt engineering, RAG development, evaluating AI models and products, and/or data transformation for AI use cases, such as vectorization.

AI engineering is at the heart of what we do at Semantiks as a company. As such, we are limited in the solutions we offer via Professional Services to avoid creating competition with our own proprietary technologies. However, we do offer limited support for some AI engineering use cases.

Tools

  1. Vector Search

Firestore: Cloud Firestore, Firebase's No-SQL database, offers a native data type for vector embeddings. We have implemented patterns to automatically add embeddings to Firestore collections to enable vector search algorithms. This offers existing Firestore customers an option to perform Retrieval-Augmented Generation (RAG) without deploying additional vector database infrastructure.

  1. AI Analytics

LangSmith: For existing LangSmith customers, we have implemented comprehensive solutions for engineering solutions for trace data ingestion, in-warehouse modeling, and visualization via Looker.