Kinesis is a real-time data processing service that captures and stores large amounts of data to power dashboard and analytics.

In the past realtime process of massive amounts of data was hard… very hard, in fact. Kinesis is quite useful when you need to do multi-stage processing of data, partition the data then load the data. Tons of application in realtime gaming, IoT & mobile app analytics, or monitoring apps and system logs in real-time.

And it is not just one app at time… there can be multiple incoming data streams working concurrently. And it is durable, with data being written to three AZ yet not long lived with data living 24 hours by default and up to 7 days.

There are three components to the Kinesis products:

  • Firehose - a data loader which can batch, compress, and encrypt the data into S3, Redshift, ElasticSearch or Kinesis Analytics

  • Analytics - the ability to analyze a stream using SQL in an interactive tool including a SQL editor and templates

  • Streams - the thing that firehose works on

As a complete side note, Kinesis is the best marketed product in the AWS line.

Components

Data producers

Data producers add records to the stream using:

  • Kinesis Streams API via the PutRecord or PutRecords calls

  • Kinesis Producer Library - for developing resusable producers

  • Kinesis Agent - java application for linux devices

Records

Records include sequence number, partitian key and up to a one mg blob of data. Partitian keys are assigned to records prior to write into the stream while sequence keys are generated by Kinesis after the client and are unrelated to the partitian key.

Streams & Shards

A stream is made of up of shards and a shard is 1MB per second write and 2MB per second read capacity. There are tons of connectors, libraries, and tools available.

The max size of a datablob is 1 MB.

You can calculate the initial number of shards (number_of_shards) that your stream will need by using the input values in the following formula:

number_of_shards = max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)

The number of partition keys should typically be much greater than the number of shards. This is because the partition key is used to determine how to map a data record to a particular shard. If you have enough partition keys, the data can be evenly distributed across the shards in a stream.