When I speak to people who are thinking about implementing BI, they are often overwhelmed by all the things they could measure. Many start by wanting to measure everything, which doesn’t necessarily help them. That’s because there’s an inherent cost in measuring things – everything you report and track creates an ongoing burden that your organization has to maintain. That’s why it’s important to be selective about what you measure from the get-go.
In our last blog, we talked about developing data processing jobs using Apache Beam. This time we are going to talk about one of the most demanded things in modern Big Data world nowadays – processing of Streaming data. The principal difference between Batch and Streaming is the type of input data source. When your data set is limited (even if it’s huge in terms of size) and it is not being updated along the time of processing, then you would likely use a batching pipeline.