11/21/2023 0 Comments Aws redshift spectrumWe saw that we should preserve query access to the data, but were paying too high a cost for its storage. Many had query logs showing a few ad hoc investigations spread over many weeks. Some were used to update downstream tables that tracked company metrics or experiment results in daily or hourly intervals. We were concerned to see disk space use growing more quickly than our compute needs, especially after we found most behavioral events tables were infrequently queried. ![]() ![]() Storing all the behavioral events data in the Redshift cluster enabled speedy query responses, at the cost of adding new nodes whenever disk space ran low. Our Amazon Redshift cluster comprises DC2 nodes, which store data locally for fast performance. Sometimes, aggregate data from ‘table_a’ is stored in a Redshift table ‘agg_table_a’. Then, the JSON files are inserted into a Redshift table. It processes the data and writes it as JSON files to S3 prefix beta/event=A. A Spark job reads Parquet files from an S3 prefix alpha/event=A. This diagram depicts transform and load steps for an example behavioral event type “A”. Amazon Redshift served as our analytics database, where Data Scientists and Analysts used the event tables for ad hoc exploration and building downstream tables.ĭiagram: Transform & load to Redshift pipeline From the “beta” S3 location, the JSON data was copied into Amazon Redshift tables. The processing step deduplicates events, removes unsupported characters, updates a few field types, and flattens nested fields. Our old pipeline read raw “alpha” data from one AWS S3 location and wrote processed “beta” JSON files to another. The raw “alpha” data remained the source data for both versions of the downstream pipeline. This part of the behavioral events data pipeline continued to serve our needs and was not updated to protect storage costs. Spark consumers read from the Kafka topics and write the data to S3 at example prefixes alpha/event=A and alpha/event=B The Go service sends the messages to Kafka topics matching the event types. Phones and the VSCO website send behavioral event messages to a Go service. We call these unprocessed Parquet files “alpha” data.ĭiagram: Behavioral event origins and upstream pipeline A consumer reads batches of event messages from the topic and writes them as Apache Parquet files to AWS S3. The event message is sent to an API that publishes it to a Kafka topic. For example, editing a photo will result in an event message describing which preset was used. When people interact with VSCO on our web or mobile platforms, their actions trigger the platforms to record event messages. Let’s first overview the data’s upstream origins before describing the transform and load steps of that behavioral events data pipeline. Increased user engagement across VSCO’s mobile and web platforms was great news, but we wanted to limit storage costs for the rising influx of event data. In late 2018, VSCO’s behavioral event data was growing quickly. ![]() The first use case focuses on how we updated the transformation and loading of behavioral events for analytics. This piece describes steps taken to adopt Redshift Spectrum for our primary use case - behavioral events data, lists subsequent use cases, and closes with tips we’ve learned along the way. VSCO uses Amazon Redshift Spectrum with AWS Glue Catalog to query data in S3. Image of colorful lights over a subway sign by Quijano Flores
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |