History Stream

The History Stream lets you drive any type of downstream consumption of backed up or archived data by designating a set of objects for automated continuous export to Parquet format in your object storage bucket.

Introduction

With History Stream you can take control of your historical data, and do more with it throughout your enterprise. Whether you're looking to load data into a central warehouse, or feed an analytics platform such as Tableau, AWS QuickSight or PowerBI the History Stream facilitates easy data movement with minimal configuration and setup.

What's more is that you access to your entire historical data set, not just the latest version.

How It Works

Once enabled, the History Stream continously monitors the GRAX Vault for new data. When it finds data, and it's safe to write, it will be written to parquet format as detailed below. Every backed up version of a record for an enabled object will be included.

Data Structure

As is the case for backup data, the History Stream writes parquet files to your configured object store, typically AWS S3 or Azure Blob Storage.

In modern object stores, we can use object key naming schemes to take advantage of the massive read/write scalability of these services. Furthermore, many of the data lake applications that would typically be employed in ingesting and processing History Stream data supports common naming conventions to assist with efficiently recognizing and ingesting new data. With these two considerations in mind, we structure our History Stream data as follows:

parquet/org=1234/object=myObject/day=nn/hour=nn

  • parquet is the top level directory you will see in object storage bucket. All History Stream data goes here. You don't need to create this bucket, it will be created automatically.
  • org represents the unique ID of your Salesforce Org
  • object is the name of your enabled object. If you have multiple objects configured for History Stream, you will see a directory for each in here.
  • day is the day the data was written to the GRAX Vault. Depending on your backup job configuration, you may not have data for every day. You'll see a directory for each day where data was backed up.
  • hour is the hour data was written to the GRAX vault. If you have hourly backups enabled, you might see as many as 24 directories in here.

📘

Note: Times are in UTC

Turning It On

For an object to be included in the History Stream, you have to explicitly turn it on. Any object that's supported for backup will work. While there will eventually be a simple UI to do this, the initial version doesn't have one yet.

📘

Contact your GRAX Customer Success Manager to enable History Stream. All you need to provide is a list of objects to be included.

Verify That It's Working

Once configured for one or more objects, the History Stream process will automatically start running every 5 minutes. Here's what to expect and how to verify that it's working:

  • After the first run, ie. 5-10 minutes after the feature was turned on, run you should see the parquet directory created in your object storage bucket. It may take some time for object data to show up. See below for an explanation of what to expect.

When To Expect Data

As mentioned, we partition data by hour. Once the History Stream has written data for a particular hour for a specfic object it sets an internal watermark to keep track of where to start from on the next run.

So, if we just wrote data for hour=20 on a given day, the next run is going to start looking for data that was written in hour=21. Thus, when we write data for hour=20, we have to make sure no additional data can come in for this hour. Otherwise we might miss data on the next run. Currently, we accomplish this by writing data for a given hour 15 minutes after the hour.

📘

How long does it take for Parquet data to appear?

The maximum latency for data to appear in the History Stream is 75 minutes

Conducting a Basic Data Check

Once you've got everything set up, it's always nice to conduct a basic check to see that everything is as expected before configuring any downstream integrations etc. Here are a couple of ways to do that:

Downloading and Checking a Parquet file from S3

  1. Log in to the AWS Console
  2. Navigate to the bucket you have configured for GRAX
  3. Traverse the History Stream directory hierarchy, and locate an hourly partition with data in it. You'll see a file named something like data-00000.parquet
  4. From "Object Actions" menu, select "Download"

Once the file is downloaded to your computer, you can use a command-line utility such as parquet-tools to print its contents:

% parquet-tools cat data-00000.parquet
Title = Senior Editor
LastName = Iiannone
Id = 0035e000001uC6AAAU
SystemModstamp = 2021-04-30T20:14:14.000Z
FirstName = Jobyna
IsDeleted = false
OwnerId = 0055e000001D8WEAA0
PhotoUrl = /services/images/photo/0035e000001uC6AAAU
LastModifiedDate = 2021-04-30T20:14:14.000Z
IsEmailBounced = false
Name = Jobyna Iiannone
AccountId = 0015e0000043cIcAAI
CleanStatus = Pending
CreatedDate = 2021-04-30T20:14:14.000Z
LastModifiedById = 0055e000001D8WEAA0
CreatedById = 0055e000001D8WEAA0
Email = [email protected]

Using S3 Select

S3 Select is a recent addition to the AWS S3 feature set that allows you to issue SQL queries directly against objects stored in S3. Parquet is one of the formats supported by this feature. Here's how you use it:

  1. Log in to the AWS Console and navigate to S3
  2. Locate an hourly partition with data in it
  3. In the "Object Actions" menu, select "Query with S3 Select"
  4. On the next screen, make sure the input format is "Apache Parquet" and set output to "CSV"
  5. Run the default query to return the first 5 records from the file.

Useful to Know

Limits

📘

The History Stream currently supports up to 100 enabled objects. If you need additional objects enabled, please contact your GRAX Account Manager.

The GRAX Parquet Format

History Stream data is written in Apache Parquet format. One of the advantages of Parquet vs other interchange formats like CSV, is that you can encode schema with the object data. It also supports excellent compression. Parquet is well supported by all major ETL, data storage and data processing platforms.

Salesforce Data Types vs Parquet Data Types

In version 1.0 of the History Stream all Salesforce field data types are treated as Strings in Parquet. We do this to allow the smoothest integration with ETL tooling like AWS Glue, knowing that most of our customers will want to stage schemas to their preference once the data is onboarded in the data lake or analytics warehouse.

GRAX Metadata Fields

In addition to the Salesforce object data embedded in the parquet files, GRAX adds the following metadata fields prefixed by grax__:

  • grax__added Parquet timestamp format indicated when the record was written to the GRAX Vault.
  • grax__deleted Parquet timestamp format indicated when the record was deleted from Salesforce.
  • grax__deletesource String indicating whether the record was deleted by grax or salesforce.

Dates are stored in Unix Timestamp format in milliseconds. The integer is the number of milliseconds since January 1st 1970 00:00:00 UTC.

📘

Converting timestamps to readable dates

You may need to do some conversion to turn UNIX timestamps into easily readable dates. Fortunately, this is quite simple to do. For example, in AWS Athena you can use the following conversion function to display the grax__added metadata field as a readable date:

select from_unixtime(grax__added / 1000) AS readableDate from mytable

This would yield dates in a format like this:

2021-05-10 22:23:52.000

Note that this is still in UTC

Managing Duplicates

A fundamental principle of the History Stream is that you get all the versions of all records that we have stored for enabled objects. Ultimately this feature is what will enable you to accurately analyze how your data evolves over time. Because of this, you will experience some amount of duplicate records in the parquet extracts, and your reports, queries etc. will need to account for duplicates accordingly.

For example, the following Salesforce conditions will cause a new version to be captured by GRAX while the LastModifiedDate and SystemModstamp fields remain unchanged between versions:

  • Formula Fields: If you are backing up formula fields for a given object, and only the formula field value changed, GRAX will capture a new version but the SystemModStamp/LastModifiedDate audit date field will not change.
  • Schema Changes: When the schema changes (including changing default field values), GRAX will capture a new version but the SysModStamp/LastModified audit date field will not change. This is because each version of a record is uniquely tied to its schema at the time of capture.

Additionally, here are some examples of GRAX behaviors that can also produce duplicate records - at least at the Salesforce data level:

  • Full Backups: Running on-demand or periodic full backups for an object will capture a new version for each included record, regardless of whether we already have a version in storage with the same LastModifiedDate or SystemModstamp.
  • Archiving: As the first step of an archive process, GRAX captures a full backup of the current state of each included record. Even if the record did not change, this will result in a duplicate version.
  • Delete Tracking: GRAX performs periodic delete tracking. When we detect a deleted record, we add GRAX metadata to capture the time of the delete. Because this is purely GRAX metadata, the SysModStamp/LastModified fields will not change, but you will get a new version with a timestamp value in grax__deleted:

Managing duplicate record data: delete tracking.

Working with History Stream Data

Now that History Stream is turned on, you're ready to make use of your data in other systems. We've architected this feature to support any type of data lake or analytics platform with minimal effort. In the GRAX Training Library you'll find screencasts showing you how to integrate with some of the most popular targets for History Stream data:

  • Crawling History Stream data with AWS Glue
  • Converting historical data to CSV with AWS Glue
  • Loading data into AWS Redshift
  • SQL queries on the History Stream with AWS Athena
  • Reporting on archived data with Tableau CRM
  • Visualizing historical data in Tableau Desktop/Online
  • Analyzing historical data with AWS QuickSight
  • Loading historical data into Microsoft Azure Synapse Analytics