Use Case - Sicar
Use Case - Sicar
Use Case - Sicar



Conquering Unstructured Sales Data to Deliver 90% Performance Gains with BigQuery and Fivetran


Industry
Fintech


Product
Point of Sale


Problem
Slow Dashboards


Data Product
Monetize Your Data
Case Study
Unstructured data is a controversial subject here at Semantiks Professional Services. In the words of our Principal Consultant, Jake Klein, “NoSQL was the next big thing 10 years ago. The freedom of semi-structured data seemed great until you needed to perform analytics on data with no standards or consistency. Just use Postgres with JSON data types.”
To be clear, there are many benefits to NoSQL databases and unstructured data. For our client Sicar, the leading point of sale in Latin America, MongoDB’s flexible schema enables them to easily absorb product catalogs and transactional data from their 100s of customers. The challenges that led them to Semantiks were not with the structure of their data but rather its performance.
Analytics are a key feature of Sicar X, their cloud-based POS offering. Sicar X offers actionable reports and dashboards, empowering customers to make better inventory, sales, and staffing decisions. These reports occasionally faced performance issues, taking several minutes to load. As a solution, some reports were only available for the last few months of data. Both the slowness and limited reporting window were consistent customer pain points.
Unstructured data is a controversial subject here at Semantiks Professional Services. In the words of our Principal Consultant, Jake Klein, “NoSQL was the next big thing 10 years ago. The freedom of semi-structured data seemed great until you needed to perform analytics on data with no standards or consistency. Just use Postgres with JSON data types.”
To be clear, there are many benefits to NoSQL databases and unstructured data. For our client Sicar, the leading point of sale in Latin America, MongoDB’s flexible schema enables them to easily absorb product catalogs and transactional data from their 100s of customers. The challenges that led them to Semantiks were not with the structure of their data but rather its performance.
Analytics are a key feature of Sicar X, their cloud-based POS offering. Sicar X offers actionable reports and dashboards, empowering customers to make better inventory, sales, and staffing decisions. These reports occasionally faced performance issues, taking several minutes to load. As a solution, some reports were only available for the last few months of data. Both the slowness and limited reporting window were consistent customer pain points.
Unstructured data is a controversial subject here at Semantiks Professional Services. In the words of our Principal Consultant, Jake Klein, “NoSQL was the next big thing 10 years ago. The freedom of semi-structured data seemed great until you needed to perform analytics on data with no standards or consistency. Just use Postgres with JSON data types.”
To be clear, there are many benefits to NoSQL databases and unstructured data. For our client Sicar, the leading point of sale in Latin America, MongoDB’s flexible schema enables them to easily absorb product catalogs and transactional data from their 100s of customers. The challenges that led them to Semantiks were not with the structure of their data but rather its performance.
Analytics are a key feature of Sicar X, their cloud-based POS offering. Sicar X offers actionable reports and dashboards, empowering customers to make better inventory, sales, and staffing decisions. These reports occasionally faced performance issues, taking several minutes to load. As a solution, some reports were only available for the last few months of data. Both the slowness and limited reporting window were consistent customer pain points.
4 weeks
Project Completion Time
4
# of Data Sources
>90%
Reduction in Query execution time
<10 Minutes
Data Freshness
The Problem
There were two root causes of these performance issues: infrastructure and tooling. These may seem redundant, but there was an important nuance to understand for us to solve Sicar’s challenges. First, data lived in four different sources: document databases MongoDB and DynamoDB, relational data in MySQL, and reporting tables in Google BigQuery. There are perfectly rational reasons to leverage so many different data sources, from a product or engineering perspective. However, this often led Sicar’s reporting platform to query data from multiple sources and join it in the front end, which is nonperformant.
Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.
To summarize, we have two problems: disparate data sources and tools unequipped for our specific reporting challenges. It would not be sufficient only consolidating our data to one source (i.e. MongoDB), nor would be optimizing our MongoDB queries or clusters. Our solution needed to solve both the infrastructure and tooling causes.
There were two root causes of these performance issues: infrastructure and tooling. These may seem redundant, but there was an important nuance to understand for us to solve Sicar’s challenges. First, data lived in four different sources: document databases MongoDB and DynamoDB, relational data in MySQL, and reporting tables in Google BigQuery. There are perfectly rational reasons to leverage so many different data sources, from a product or engineering perspective. However, this often led Sicar’s reporting platform to query data from multiple sources and join it in the front end, which is nonperformant.
Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.
To summarize, we have two problems: disparate data sources and tools unequipped for our specific reporting challenges. It would not be sufficient only consolidating our data to one source (i.e. MongoDB), nor would be optimizing our MongoDB queries or clusters. Our solution needed to solve both the infrastructure and tooling causes.
There were two root causes of these performance issues: infrastructure and tooling. These may seem redundant, but there was an important nuance to understand for us to solve Sicar’s challenges. First, data lived in four different sources: document databases MongoDB and DynamoDB, relational data in MySQL, and reporting tables in Google BigQuery. There are perfectly rational reasons to leverage so many different data sources, from a product or engineering perspective. However, this often led Sicar’s reporting platform to query data from multiple sources and join it in the front end, which is nonperformant.
Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.
To summarize, we have two problems: disparate data sources and tools unequipped for our specific reporting challenges. It would not be sufficient only consolidating our data to one source (i.e. MongoDB), nor would be optimizing our MongoDB queries or clusters. Our solution needed to solve both the infrastructure and tooling causes.
The Solution
“Storage is cheap and getting cheaper,” is another of Klein’s mantras. Redundant data, in other words, is a reasonable cost to pay for performant systems. More specifically, our solution was to create a single source of truth for analytics, consolidating data from all systems, without altering the transactional data. This is by no means a novel approach. Many modern organizations will leverage both operational databases and data warehousing, leverage each tool where they best perform.
Our solution had three phases:
“Storage is cheap and getting cheaper,” is another of Klein’s mantras. Redundant data, in other words, is a reasonable cost to pay for performant systems. More specifically, our solution was to create a single source of truth for analytics, consolidating data from all systems, without altering the transactional data. This is by no means a novel approach. Many modern organizations will leverage both operational databases and data warehousing, leverage each tool where they best perform.
Our solution had three phases:
“Storage is cheap and getting cheaper,” is another of Klein’s mantras. Redundant data, in other words, is a reasonable cost to pay for performant systems. More specifically, our solution was to create a single source of truth for analytics, consolidating data from all systems, without altering the transactional data. This is by no means a novel approach. Many modern organizations will leverage both operational databases and data warehousing, leverage each tool where they best perform.
Our solution had three phases:



1.
Consolidate reporting from all sources into our data warehouse of choice: Google Cloud BigQuery,



2.
Transform and model the data not only for Sicar X’s current reporting needs but also with flexibility for future analytics



3.
Stage data in reporting formats to minimize the amount of logic performed at query time
Our Architecture



To recap, our data sources are:
A MongoDB Atlas cluster, hosted in Google Cloud,
DynamoDB, hosted in AWS,
A MySQL database, hosted in GCP Cloud SQL, and
Miscellaneous data already available in BigQuery
And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.
For MongoDB, the absolute barebones implementation we could deliver was quoted at $2,000, not counting cloud consumption costs. Based on Sicar’s data volumes (7.5 million monthly transactions), we estimated Fivetran consumption to cost $1,000 - $1,500 per month. With a two week free trial for all Fivetran connectors, this would give us a six week window of continuous data flow during which we could develop our data model, test the performance gains, and start planning the platform migration work to leverage our new analytics layer. Sicar could spend that time implementing their own changes rather than onboarding to a complex MongoDB to BigQuery pipeline that they lacked the resources to maintain. From a cost and labor perspective, it was significantly more efficient to leverage Fivetran for our implementation and develop a self-hosted pipeline after the customer pain points from poor reporting performance were addressed.
To recap, our data sources are:
A MongoDB Atlas cluster, hosted in Google Cloud,
DynamoDB, hosted in AWS,
A MySQL database, hosted in GCP Cloud SQL, and
Miscellaneous data already available in BigQuery
And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.
For MongoDB, the absolute barebones implementation we could deliver was quoted at $2,000, not counting cloud consumption costs. Based on Sicar’s data volumes (7.5 million monthly transactions), we estimated Fivetran consumption to cost $1,000 - $1,500 per month. With a two week free trial for all Fivetran connectors, this would give us a six week window of continuous data flow during which we could develop our data model, test the performance gains, and start planning the platform migration work to leverage our new analytics layer. Sicar could spend that time implementing their own changes rather than onboarding to a complex MongoDB to BigQuery pipeline that they lacked the resources to maintain. From a cost and labor perspective, it was significantly more efficient to leverage Fivetran for our implementation and develop a self-hosted pipeline after the customer pain points from poor reporting performance were addressed.
To recap, our data sources are:
A MongoDB Atlas cluster, hosted in Google Cloud,
DynamoDB, hosted in AWS,
A MySQL database, hosted in GCP Cloud SQL, and
Miscellaneous data already available in BigQuery
And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.
For MongoDB, the absolute barebones implementation we could deliver was quoted at $2,000, not counting cloud consumption costs. Based on Sicar’s data volumes (7.5 million monthly transactions), we estimated Fivetran consumption to cost $1,000 - $1,500 per month. With a two week free trial for all Fivetran connectors, this would give us a six week window of continuous data flow during which we could develop our data model, test the performance gains, and start planning the platform migration work to leverage our new analytics layer. Sicar could spend that time implementing their own changes rather than onboarding to a complex MongoDB to BigQuery pipeline that they lacked the resources to maintain. From a cost and labor perspective, it was significantly more efficient to leverage Fivetran for our implementation and develop a self-hosted pipeline after the customer pain points from poor reporting performance were addressed.
The Implementation
Very briefly, because our approach to modeling should be its own post, our data model was as follows:
Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)
All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)
Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)
Lookup Tables from Cloud SQL are accessed via federated queries
To accomplish this, we converted 1000+ line MongoDB aggregation pipelines into the SQL queries that would become our L3 tables, and then worked backwards to create our L2 normalized tables in between the raw data and our final outputs. After executing several test cases to ensure parity with current Sicar X reporting, we performed cost, freshness and performance benchmarking on various implementation strategies to give Sicar the best possible solution for their needs and budget.
Very briefly, because our approach to modeling should be its own post, our data model was as follows:
Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)
All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)
Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)
Lookup Tables from Cloud SQL are accessed via federated queries
To accomplish this, we converted 1000+ line MongoDB aggregation pipelines into the SQL queries that would become our L3 tables, and then worked backwards to create our L2 normalized tables in between the raw data and our final outputs. After executing several test cases to ensure parity with current Sicar X reporting, we performed cost, freshness and performance benchmarking on various implementation strategies to give Sicar the best possible solution for their needs and budget.
Very briefly, because our approach to modeling should be its own post, our data model was as follows:
Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)
All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)
Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)
Lookup Tables from Cloud SQL are accessed via federated queries