Tailor made AWS Lambda for Data Stream processing

Getting Started

Here is link to an example IaC template for creating ingestion Tap. You can validate and deploy templates with bdcli.

Please use this “Getting Started with Data Taps” GitHub repository.

Also, please join our Discord chat, we can help you there! https://discord.gg/Zq5t3KjndG
(you can also send Dan private message if you don’t want to discuss in public, he will be glad to guide you and discuss the use cases).

It has targets for all phases and step by step guide to get started.

You can also check this Web Analytics example GitHub repository.

NOTE! Data Taps is deployed to the following AWS Regions: eu-west-1, eu-north-1, us-west-2, us-east-2.

When you deploy, you will get URL as a response, where you can send your data.

The repository has targets for fetching authorization token, getting the Tap URL from deployment output, and sending test data to the Tap.

The URLs are AWS Lambda Function URLs and the Data Taps free tier sets the Lambda concurrency limit to 2. This means that you will have at most two Lambda functions ingesting your data, which is suitable for most use cases as Lambda starts running only after the data has been fully received by AWS Lambda service and it processes the data very fast. Paid tiers have higher concurrency and higher data volume supported.

In fact, AWS Lambda with Function URLs is like made for stream data processing as it minimises Lambda function runtime, caches the data in memory or in the Lambda local disk. Data loss is prevented by using Lambda extension that flushes data, if any left, when the Lambda container shuts down.

You don’t have to worry about fluctuating ingestion traffic, or high peaks as AWS Lambda service scales fast, independently for each and every Data Tap you have, up to thousand concurrent functions. There is no need to (auto) scale clusters horizontally or vertically, instead AWS Lambda rapidly responses to your ingestion traffic loads.

In practise, it is hard to scale to tens of AWS Lambda instance processes as messages are processed fast and the network bandwidth will limit your capability to push data — when you’re sending next batch, Lambda already processed the previous one long time ago. Furthermore, the incoming JSON data, when converted and compressed to ZSTD Parquet will be fractions of the incoming data size, saving you money with S3 costs. Even more if you use SQL to aggregate the traffic to various metric values and discard the raw data.