I have been working with Amazon Kinesis Data Firehose for over a year, and I'm pretty happy with the way how it works. In a nutshell, it's a service that does writing to Amazon S3 with custom data transformation and buffering rules. My current use case is simple -- write events into Amazon S3 for further processing with Apache Spark.
Unfortunately, the more events I have, the more small files land on S3, and processing time with Spark slow down.
That is where Fireblender comes in. The idea of the project is simple -- given time range, join all data files into bigger chunks to allow faster processing and return URL.
I have started this project with a sample data generator called fireblender-datagen. It will be extended with additional data sets and generation strategies.
For the main part of the Fireblender, I want to start simple and focus on binary files concatenation and run all code from AWS Lambda. The first step will be the choice of underlying technology -- I have limited my choice to Python, C#, and Go as programming languages. I'm going to test raw file processing performance and then do the same operations with S3 integration.