Sicar | Semantiks Pro Services

Use Case - Sicar

Conquering Unstructured Sales Data to Deliver 90% Performance Gains with BigQuery and Fivetran

Industry

Fintech

Product

Point of Sale

Problem

Slow Dashboards

Data Product

Monetize Your Data

Case Study

Unstructured data is a controversial subject here at Semantiks Professional Services. In the words of our Principal Consultant, Jake Klein, “NoSQL was the next big thing 10 years ago. The freedom of semi-structured data seemed great until you needed to perform analytics on data with no standards or consistency. Just use Postgres with JSON data types.”

To be clear, there are many benefits to NoSQL databases and unstructured data. For our client Sicar, the leading point of sale in Latin America, MongoDB’s flexible schema enables them to easily absorb product catalogs and transactional data from their 100s of customers. The challenges that led them to Semantiks were not with the structure of their data but rather its performance.

Analytics are a key feature of Sicar X, their cloud-based POS offering. Sicar X offers actionable reports and dashboards, empowering customers to make better inventory, sales, and staffing decisions. These reports occasionally faced performance issues, taking several minutes to load. As a solution, some reports were only available for the last few months of data. Both the slowness and limited reporting window were consistent customer pain points.

Unstructured data is a controversial subject here at Semantiks Professional Services. In the words of our Principal Consultant, Jake Klein, “NoSQL was the next big thing 10 years ago. The freedom of semi-structured data seemed great until you needed to perform analytics on data with no standards or consistency. Just use Postgres with JSON data types.”

To be clear, there are many benefits to NoSQL databases and unstructured data. For our client Sicar, the leading point of sale in Latin America, MongoDB’s flexible schema enables them to easily absorb product catalogs and transactional data from their 100s of customers. The challenges that led them to Semantiks were not with the structure of their data but rather its performance.

Analytics are a key feature of Sicar X, their cloud-based POS offering. Sicar X offers actionable reports and dashboards, empowering customers to make better inventory, sales, and staffing decisions. These reports occasionally faced performance issues, taking several minutes to load. As a solution, some reports were only available for the last few months of data. Both the slowness and limited reporting window were consistent customer pain points.

Unstructured data is a controversial subject here at Semantiks Professional Services. In the words of our Principal Consultant, Jake Klein, “NoSQL was the next big thing 10 years ago. The freedom of semi-structured data seemed great until you needed to perform analytics on data with no standards or consistency. Just use Postgres with JSON data types.”

To be clear, there are many benefits to NoSQL databases and unstructured data. For our client Sicar, the leading point of sale in Latin America, MongoDB’s flexible schema enables them to easily absorb product catalogs and transactional data from their 100s of customers. The challenges that led them to Semantiks were not with the structure of their data but rather its performance.

Analytics are a key feature of Sicar X, their cloud-based POS offering. Sicar X offers actionable reports and dashboards, empowering customers to make better inventory, sales, and staffing decisions. These reports occasionally faced performance issues, taking several minutes to load. As a solution, some reports were only available for the last few months of data. Both the slowness and limited reporting window were consistent customer pain points.

4 weeks

Project Completion Time

4

# of Data Sources

>90%

Reduction in Query execution time

<10 Minutes

Data Freshness

The Problem

There were two root causes of these performance issues: infrastructure and tooling. These may seem redundant, but there was an important nuance to understand for us to solve Sicar’s challenges. First, data lived in four different sources: document databases MongoDB and DynamoDB, relational data in MySQL, and reporting tables in Google BigQuery. There are perfectly rational reasons to leverage so many different data sources, from a product or engineering perspective. However, this often led Sicar’s reporting platform to query data from multiple sources and join it in the front end, which is nonperformant.

Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.

To summarize, we have two problems: disparate data sources and tools unequipped for our specific reporting challenges. It would not be sufficient only consolidating our data to one source (i.e. MongoDB), nor would be optimizing our MongoDB queries or clusters. Our solution needed to solve both the infrastructure and tooling causes.

There were two root causes of these performance issues: infrastructure and tooling. These may seem redundant, but there was an important nuance to understand for us to solve Sicar’s challenges. First, data lived in four different sources: document databases MongoDB and DynamoDB, relational data in MySQL, and reporting tables in Google BigQuery. There are perfectly rational reasons to leverage so many different data sources, from a product or engineering perspective. However, this often led Sicar’s reporting platform to query data from multiple sources and join it in the front end, which is nonperformant.

Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.

To summarize, we have two problems: disparate data sources and tools unequipped for our specific reporting challenges. It would not be sufficient only consolidating our data to one source (i.e. MongoDB), nor would be optimizing our MongoDB queries or clusters. Our solution needed to solve both the infrastructure and tooling causes.

There were two root causes of these performance issues: infrastructure and tooling. These may seem redundant, but there was an important nuance to understand for us to solve Sicar’s challenges. First, data lived in four different sources: document databases MongoDB and DynamoDB, relational data in MySQL, and reporting tables in Google BigQuery. There are perfectly rational reasons to leverage so many different data sources, from a product or engineering perspective. However, this often led Sicar’s reporting platform to query data from multiple sources and join it in the front end, which is nonperformant.

Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.

To summarize, we have two problems: disparate data sources and tools unequipped for our specific reporting challenges. It would not be sufficient only consolidating our data to one source (i.e. MongoDB), nor would be optimizing our MongoDB queries or clusters. Our solution needed to solve both the infrastructure and tooling causes.

The Solution

“Storage is cheap and getting cheaper,” is another of Klein’s mantras. Redundant data, in other words, is a reasonable cost to pay for performant systems. More specifically, our solution was to create a single source of truth for analytics, consolidating data from all systems, without altering the transactional data. This is by no means a novel approach. Many modern organizations will leverage both operational databases and data warehousing, leverage each tool where they best perform.

Our solution had three phases:

“Storage is cheap and getting cheaper,” is another of Klein’s mantras. Redundant data, in other words, is a reasonable cost to pay for performant systems. More specifically, our solution was to create a single source of truth for analytics, consolidating data from all systems, without altering the transactional data. This is by no means a novel approach. Many modern organizations will leverage both operational databases and data warehousing, leverage each tool where they best perform.

Our solution had three phases:

“Storage is cheap and getting cheaper,” is another of Klein’s mantras. Redundant data, in other words, is a reasonable cost to pay for performant systems. More specifically, our solution was to create a single source of truth for analytics, consolidating data from all systems, without altering the transactional data. This is by no means a novel approach. Many modern organizations will leverage both operational databases and data warehousing, leverage each tool where they best perform.

Our solution had three phases:

1.

Consolidate reporting from all sources into our data warehouse of choice: Google Cloud BigQuery,

2.

Transform and model the data not only for Sicar X’s current reporting needs but also with flexibility for future analytics

3.

Stage data in reporting formats to minimize the amount of logic performed at query time

Our Architecture

To recap, our data sources are:

A MongoDB Atlas cluster, hosted in Google Cloud,
DynamoDB, hosted in AWS,
A MySQL database, hosted in GCP Cloud SQL, and
Miscellaneous data already available in BigQuery

And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.

For MongoDB, the absolute barebones implementation we could deliver was quoted at $2,000, not counting cloud consumption costs. Based on Sicar’s data volumes (7.5 million monthly transactions), we estimated Fivetran consumption to cost $1,000 - $1,500 per month. With a two week free trial for all Fivetran connectors, this would give us a six week window of continuous data flow during which we could develop our data model, test the performance gains, and start planning the platform migration work to leverage our new analytics layer. Sicar could spend that time implementing their own changes rather than onboarding to a complex MongoDB to BigQuery pipeline that they lacked the resources to maintain. From a cost and labor perspective, it was significantly more efficient to leverage Fivetran for our implementation and develop a self-hosted pipeline after the customer pain points from poor reporting performance were addressed.

To recap, our data sources are:

A MongoDB Atlas cluster, hosted in Google Cloud,
DynamoDB, hosted in AWS,
A MySQL database, hosted in GCP Cloud SQL, and
Miscellaneous data already available in BigQuery

And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.

For MongoDB, the absolute barebones implementation we could deliver was quoted at $2,000, not counting cloud consumption costs. Based on Sicar’s data volumes (7.5 million monthly transactions), we estimated Fivetran consumption to cost $1,000 - $1,500 per month. With a two week free trial for all Fivetran connectors, this would give us a six week window of continuous data flow during which we could develop our data model, test the performance gains, and start planning the platform migration work to leverage our new analytics layer. Sicar could spend that time implementing their own changes rather than onboarding to a complex MongoDB to BigQuery pipeline that they lacked the resources to maintain. From a cost and labor perspective, it was significantly more efficient to leverage Fivetran for our implementation and develop a self-hosted pipeline after the customer pain points from poor reporting performance were addressed.

To recap, our data sources are:

A MongoDB Atlas cluster, hosted in Google Cloud,
DynamoDB, hosted in AWS,
A MySQL database, hosted in GCP Cloud SQL, and
Miscellaneous data already available in BigQuery

And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.

For MongoDB, the absolute barebones implementation we could deliver was quoted at $2,000, not counting cloud consumption costs. Based on Sicar’s data volumes (7.5 million monthly transactions), we estimated Fivetran consumption to cost $1,000 - $1,500 per month. With a two week free trial for all Fivetran connectors, this would give us a six week window of continuous data flow during which we could develop our data model, test the performance gains, and start planning the platform migration work to leverage our new analytics layer. Sicar could spend that time implementing their own changes rather than onboarding to a complex MongoDB to BigQuery pipeline that they lacked the resources to maintain. From a cost and labor perspective, it was significantly more efficient to leverage Fivetran for our implementation and develop a self-hosted pipeline after the customer pain points from poor reporting performance were addressed.

The Implementation

Very briefly, because our approach to modeling should be its own post, our data model was as follows:

Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)
All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)
Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)
Lookup Tables from Cloud SQL are accessed via federated queries

To accomplish this, we converted 1000+ line MongoDB aggregation pipelines into the SQL queries that would become our L3 tables, and then worked backwards to create our L2 normalized tables in between the raw data and our final outputs. After executing several test cases to ensure parity with current Sicar X reporting, we performed cost, freshness and performance benchmarking on various implementation strategies to give Sicar the best possible solution for their needs and budget.

Very briefly, because our approach to modeling should be its own post, our data model was as follows:

Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)
All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)
Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)
Lookup Tables from Cloud SQL are accessed via federated queries

To accomplish this, we converted 1000+ line MongoDB aggregation pipelines into the SQL queries that would become our L3 tables, and then worked backwards to create our L2 normalized tables in between the raw data and our final outputs. After executing several test cases to ensure parity with current Sicar X reporting, we performed cost, freshness and performance benchmarking on various implementation strategies to give Sicar the best possible solution for their needs and budget.

Very briefly, because our approach to modeling should be its own post, our data model was as follows:

Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)
All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)
Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)
Lookup Tables from Cloud SQL are accessed via federated queries

To accomplish this, we converted 1000+ line MongoDB aggregation pipelines into the SQL queries that would become our L3 tables, and then worked backwards to create our L2 normalized tables in between the raw data and our final outputs. After executing several test cases to ensure parity with current Sicar X reporting, we performed cost, freshness and performance benchmarking on various implementation strategies to give Sicar the best possible solution for their needs and budget.

Conclusion and Learnings

BigQuery’s performance gains were exemplary. A test case against MongoDB taking 2.5 minutes could be completed in BigQuery within 10 seconds, an improvement greater than 90%. Furthermore, the benefits of Fivetran as a managed service outweighed the consumption costs, leading Sicar to sign an annual commitment and deprioritizing a self hosted pipeline.

Conquering Unstructured Sales Data to Deliver 90% Performance Gains with BigQuery and Fivetran

Industry

Product

Problem

Data Product

Case Study

4 weeks

4

>90%

<10 Minutes

The Problem

Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.

Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.

Additionally, MongoDB struggled to perform the join, unnesting, and deduplication logic that Sicar X transactional data required. A solution was to perform queries in chunks, stitching results together at the end. This made large queries tractable, but still caused significant performance issues.

The Solution

Our solution had three phases:

Our solution had three phases:

Our solution had three phases:

1.

2.

3.

Our Architecture

To recap, our data sources are:

A MongoDB Atlas cluster, hosted in Google Cloud,

DynamoDB, hosted in AWS,

A MySQL database, hosted in GCP Cloud SQL, and

Miscellaneous data already available in BigQuery

And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.

To recap, our data sources are:

A MongoDB Atlas cluster, hosted in Google Cloud,

DynamoDB, hosted in AWS,

A MySQL database, hosted in GCP Cloud SQL, and

Miscellaneous data already available in BigQuery

And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.

To recap, our data sources are:

A MongoDB Atlas cluster, hosted in Google Cloud,

DynamoDB, hosted in AWS,

A MySQL database, hosted in GCP Cloud SQL, and

Miscellaneous data already available in BigQuery

And we are using Google BigQuery as our data warehousing solution. BigQuery, fortunately, can read data directly from Cloud SQL. This left us only in need of solutions for MongoDB and DynamoDB. For us, the choice was obvious: Fivetran.

The Implementation

Very briefly, because our approach to modeling should be its own post, our data model was as follows:

Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)

All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)

Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)

Lookup Tables from Cloud SQL are accessed via federated queries

Very briefly, because our approach to modeling should be its own post, our data model was as follows:

Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)

All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)

Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)

Lookup Tables from Cloud SQL are accessed via federated queries

Very briefly, because our approach to modeling should be its own post, our data model was as follows:

Raw data is written from MongoDB and DynamoDB via Fivetran into our staging layer (L1)

All of the necessary transformations such as unnesting product arrays, deduplicating records, extracting JSON keys, are performed in normalized analytics tables (L2)

Reporting tables are created with schemas mirroring Sicar X reports, as needed (L3)

Lookup Tables from Cloud SQL are accessed via federated queries

Conclusion and Learnings