2 min read
Demystifying Edge Computing: Part 4 – Analytics at the Edge, Fog and Cloud
Analytics at the Edge, Fog and Cloud We have spoken about edge, fog and to a much lesser degree cloud computing. Each of them have pros and...
Build intelligent, data-driven capabilities that turn raw information into insights, automation, and smarter decision-making across your organization.
Modernize, secure, and operationalize your cloud environment with solutions that strengthen resilience, reduce risk, and improve IT performance.
Deliver modern applications and connected IoT solutions that enhance operations, streamline workflows, and create seamless digital experiences.
High-impact IT project execution from planning to delivery, aligned with business goals and designed for predictable outcomes.
Structured change management and M&A support that helps teams adapt, reduce disruption, and successfully navigate complex transitions.
Cloud-first IT operations that streamline cost, strengthen security, and provide modern, scalable infrastructure for growing teams.
If you follow press about the technology industry you see daily talking points about the rise of artificial intelligence and how it will reshape industries and competitive landscapes. At Serverless Solutions we fundamentally believe that to be true, but we have clients that still have what you might consider rudimentary use cases on data and data consumption. Let’s call it reporting.

How you do reporting has changed a little in recent years. Traditionally you might have invested a lot of time developing a data warehouse or operational data store then added reporting capabilities to it. Those are not bad investments even today, but most businesses and users will not wait months and years for that data to be available to them. As a result we have seen the rise of powerful visualization and data consolidation tools like Tableau and PowerBI that enable power business users to do things with data that were previously the domain of IT.
Google has a powerful answer to this new set of expectations and tools in the marketplace. I say powerful because there is power in simplicity. Google has done an amazing job of making their services simple and approachable. But behind them are amazing things happening that allow their users to develop and deploy quickly and scale to mind-boggling proportions if needed.
I call these services from Google ‘serverless’. What we are really saying when we use that term is Google is managing the provisioning and deployment for you at run-time allowing you as a creator to focus on creating things rather than managing them.
Before we dive into specific services, let’s talk about infrastructure vs platform services. In the cloud world you hear about Infrastructure as a service (IaaS), Platform as a service (PaaS) and Software as a Service (SaaS) and recently things like Serverless or Function based infrastructure (FaaS).
The services we will discuss in this paper occupy the serverless and/or SaaS space. There are many advantages to this:
Restating the above – The biggest assumption we are making is this is a cloud based, fully managed (aka serverless) infrastructure. I am explicitly stating this again because this differs from what you see from tools like Tableau that work as a desktop tool or an open source deployment via a container of something like Apache-Drill.
Our end goal is to provide the following for our clients:
To achieve those goals we will talk about combining 3 simple services from Google Cloud Platform.

Let’s take a tour of each of the services above in the order we would put this design pattern to use.
My first interactions with Cloud DataPrep were frustrating and I abandoned working with the service. Initially I did not like that:
The reality is I did not fully appreciate the assumptions above: This was a cloud managed service and I was forcing a different approach into a tool that did not make sense.
I revisited the service because a client gave me a requirement and a timeline that was very aggressive and I needed some help. I did not have time to struggle though Python and Pandas data wrangling. I forced myself to use Cloud DataPrep and I am glad I did. After getting more comfortable with it, I found it to be what I have needed for a long time when it comes to data cleaning and preparation. This quote from the Google site captures my thoughts well:
“Cloud Dataprep automatically detects schemas, datatypes, possible joins, and anomalies such as missing values, outliers, and duplicates so you get to skip the time-consuming work of profiling your data and go right to the data analysis”
So, it makes a guess at what I want. There are some tools that I really dislike because they are so poor at making good recommendation or are difficult to change once they make a recommendation. Cloud Data Prep seemed reasonably good at guessing what I wanted and when it was wrong, it was easy to change and see my changes side-by-side with what was previously suggested.
This video by Google does a great job of providing an overview of Cloud Data Prep.

Using the diagram above we can talk about the highlights of using Data Prep by activity.
Data Visualization & Exploration: When you are working with data in Cloud DataPrep, the tool is constantly showing you distributions and things about your data. If I were wrangling data in Python, I might be doing histograms or using packages like missingno to ‘see’ what my data looks like and explicitly setting data types. That equates to lines of code and manipulating the code until it shows and does what I want. Cloud DataPrep makes very intelligent guesses about your data and shows it to you in real time.

In the graphic above, Data Prep is showing distributions for data and it has made some guesses at the type of data and the format it should be in. Note the icon at the left of Phone_Number column. Data Prep thinks this column is a phone number and has applied a data type and a data mask to it. It is also showing me that in this sample (more on samples later), 19 of the records do not fit the data type or mask. If I do not like the guesses Data Prep made for me, it is simple to change with a few clicks into sub menus. My changes become what it calls a ‘recipe’ . A recipe represents all of my changes to the data and is what will later be compiled to a Cloud DataFlow job on my behalf.
Data Manipulation (Recipe Development):
Data Prep has a powerful set of actions you can execute on your data and these actions work well on structured and unstructured data. The toolbar below shows the top level functions like merge or split and each of these has a subset of functionality that supports them.

These actions turn into recipes and can be sequenced to accomplish complex data wrangling activities.
Samples:
To speed development, you can point Cloud DataPrep at a large set of data and it will intelligently sample that data so that you are building your recipes on a smaller data set. This allows for a responsive interface and quick prototyping.
Fully Managed (Run Job/Land Data):
This is the magic part. A fully managed service. When you are happy with your recipe and how your data should look, you run it as a job. You can set this job to run as a batch or based on an event (like a scheduled time or appearance of a file). You also tell it where the data should go. In our pattern, we are going to land it in BigQuery which we will discuss later. When you run the job, some powerful things happen. Data Prep compiles your recipe(s) into code and deploys and manages them on your behalf. This means your Cloud DataPrep job becomes a Cloud DataFlow job. We will not discuss DataFlow, but think of DataFlow as the service you would write code in to do ETL. Cloud DataPrep is generating this code and deployment for us.
Looking at the graphic below we can peek into what is happening behind the scenes when we run a job. Cloud DataPrep is doing the transformations in the recipe and scaling up ‘workers’ based on need. This is the fully managed part. I am not worrying about scaling up or down any resources.

If you are like me, I was nervous the first time I saw my job spin up 100 workers to accomplish what I thought was a simple task. I was thinking, ‘geesh – what is this going to cost’. As it turns out, it is very cheap.
Cloud DataPrep will change pricing at the end of 2018. We will use the proposed pricing for this document. The pricing will vary by region, but right now the service seems to be limited to the central US region (Iowa). The pricing will be $0.6 per Data Flow worker per hour.
For example, if the job above used 8 workers for an hour the cost would be:
1 hour * $0.60 * 8 vCPUs = $4.80
You can monitor this cost and estimate it from the jobs menu.
Side Note: From an architecture perspective, this type of pricing indicates you should look at an initial batch load then do incremental changes from there rather than what is often referred to as kill-and-fill.
Now that we have cleaned, transformed and loaded data, let’s look at where that data has landed. For Google centric overview, this video is a great start to get oriented.
BigQuery is one of the more exciting services in Google Cloud Platform. There is parity developing across the 3 primary public cloud providers in many areas, but this is one area that Google is years ahead of Amazon and Microsoft. My fear for Big Query is that it is a successful service and as a result Google is investing heavily in it. That sounds odd, but recently it has been taking on many new features like inline Machine Learning. These features are interesting, but in my opinion the thing that makes BigQuery so compelling is that it is a fully managed data warehouse with a simple interface (SQL) that operates at any scale you want to throw at it. I hope that does not get lost in the features and functions.
Let’s look at some of the fundamentals of BigQuery.
BigQuery has a simple, web based interface. You can look at public and private (your) datasets and interact with them via standard SQL. See the interface below.

One of the more innovative things about BigQuery is the pricing model. There are 2 levers for cost in BigQuery.
The first time I talked about BigQuery with a Google architect, I was surprised at the scale BigQuery can operate. We were discussing a client moving an on-prem Hadoop infrastructure to the cloud. Without much experience on the topic I suggested we move them to Cloud DataProc. The Googler asked what the use case was and after hearing it said, this might sound strange but BigQuery can easily handle that workload (100s of TB of data in this case) and will be cheaper and easier to manage. After discussing it more, I don’t know if there is a limitation on the size of data you can throw at BigQuery and am now of the opinion that Hadoop/DataProc are really for backwards compatibility and you should not be developing new things on them. I am sure someone that is a Hadoop expert and not a BigQuery fan boy like me can educate me otherwise.
Now that we are visualizing data, you can use tools like Tableau or PowerBI in this architecture, however, there are a few compelling things about DataStudio I want to highlight here. It is a data visualization tool but I found it to be more than just that and with some attributes that make it worth considering before you drop money on a per user tool from a competitor. Some of the service highlights:
Google Cloud does not highlight DataStudio in their services portfolio, but it is a powerful and cheap (FREE!) service. By comparison, similar tools like Tableau and PowerBI charge based on user counts as well as user types. DataStudio is not as powerful as some of those tools, but it does create great looking visualizations and is highly customizable… did I mention it is FREE!
One of the things I did not use extensively, but I could appreciate if I was a power user was the ability to take data exploration through a process with seamless integration. For example, if you are exploring data in BigQuery via SQL and want to look for insights in your query you can click on EXPLORE IN DATA STUDIO.

This takes my results and exports them to the Explorer view in DataStudio. From there I can analyze my data to find insights and even create visualizations. Once I have something that is interesting I can export that to a DataStudio report to share with others and include in other visualizations. This is all integrated and seamless from a user experience. No copy and paste, no moving files around, just progressing through exploration of the data to a final product that I can share with others. I have found myself doing this process with tools like Excel where I might pull data from a database and save the results as a csv, load that csv to a spreadsheet, create workbooks and formulas to do some analysis… then share my spreadsheet. Doing a similar process with DataStudio was liberating and refreshing compared to the minutia I would have had to manage doing data exploration in Excel.
DataStudio does the basics well. Visualizations are clean and you can do just about anything with it. I pulled a few images of reports from the Data Studio Gallery to show how it can be used to visualize data.
To tell stories…

To interact with large data sets…

To create professional visualizations…

The visualizations are nice looking and I was able to develop my own rather quickly. There were some things that took a little getting used to. For example, DataStudio guesses at what you want to visualize when you drop a control (for example a pie chart) onto your report. The guesses were pretty good, but I found myself changing them frequently. Despite it being a web based report designer, it was very powerful and customizable (as you can see from a few of the Gallery shots).
This web based authoring becomes a strength when it comes to deployment time. I found it very easy to share and collaborate on reports with peers once I had a report developed. I could control what people did with the reports I developed and see their changes. This was seamless and refreshing compared to other experiences with tools where I had to deploy reports developed on my local machine in a disjointed process via the web.
When working with data you will inevitably need to consider combining data from disparate sources. For example, weather data with store operations. There are multiple ways to accomplish this task. The ‘legacy’ way would be to go to the data layer and ‘join’ the data. Maybe in a table, a SQL statement or in a meta-data layer provided by the report visualization tool. DataStudio allows you to blend data in the visualization itself. You simply select the data you want to ‘blend’ and ask it to blend the data. The result will be a visualization that joins the data together. It is a simple and easy to use model for power users.
To recap what some of the powerful aspects of DataStudio are:
Taking this up a level to our overall architecture, there are some levers in this architecture when it comes to pricing. To recap what we discussed earlier:
To illustrate cost in a small sample, I loaded 3M rows of data (1.3GB) with some basic transformations on each row (using Cloud DataPrep). Stored it (In BigQuery Storage) and used DataStudio to query and visualize a report (consumed BigQuery Compute). That process to transform the data, store it and run several queries on it cost me $0.60.
There are a few things driving that cost down in my example:
Those two items are part of what Google calls their “always free” tier of services. So, basically for this small sample BigQuery usage was free and all I really paid for was DataPrep to manipulate and load my data into BigQuery.
Final note on cost – DataStudio is currently free and a very capable tool. This is a huge differentiator when comparing to offerings like Tableau and PowerBI. Both of those tools have different models and have a richer set of reporting capabilities. However, Data Studio is a solid visualization platform and for the cost blows the competition away!
This paper scratches the surface of what is possible with these powerful ‘serverless’ services. Having this basic architecture in place allows you to consider more advanced things like using machine learning to predict outcomes or use advanced services like natural language processing to enrich and structure data you might not normally think is accessible to your analytics capabilities (call logs for example).
If you are evaluating business intelligence or analytics capabilities, the services from Google Cloud should be at the top of your list for consideration. Contact us today if you would like to dive deeper into your actual use cases and concerns. We would be happy to get you oriented and set-up to consume Google Cloud services.
2 min read
Analytics at the Edge, Fog and Cloud We have spoken about edge, fog and to a much lesser degree cloud computing. Each of them have pros and...
1 min read
Several new services and capabilities have been announced for Microsoft’s Cloud for Financial Services. Microsoft’s Industry Clouds program brings...
2 min read
As cloud adoption continues to grow, data sovereignty has become one of the most important priorities for organizations across industries—especially...