Systems | Development | Analytics | API | Testing

What is new in Cloudera Streaming Analytics 1.5?

At the end of May, we released the second version of Cloudera SQL Stream Builder (SSB) as part of Cloudera Streaming Analytics (CSA). Among other features, the 1.4 version of CSA surfaced the expressivity of Flink SQL in SQL Stream Builder via adding DDL and Catalog support, and it greatly improved the integration with other Cloudera Data Platform components, for example via enabling stream enrichment from Hive and Kudu.

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak Nabu

Modak, a leading provider of modern data engineering solutions, is now a certified solution partner with Cloudera. Customers can seamlessly automate migration to Cloudera’s cloud-based enterprise platform CDP from on-prem deployments and dynamically auto-scale cloud services with Cloudera Data Engineering (CDE)’s integration with Modak Nabu™.

Admission Control Architecture for Cloudera Data Platform

Apache Impala is a massively parallel in-memory SQL engine supported by Cloudera designed for Analytics and ad hoc queries against data stored in Apache Hive, Apache HBase and Apache Kudu tables. Supporting powerful queries and high levels of concurrency Impala can use significant amounts of cluster resources. In multi-tenant environments this can inadvertently impact adjacent services such as YARN, HBase, and even HDFS.

Processing DICOM Files With Spark on CDP Hybrid Cloud

In this video, you will see how you can use PySpark to process medical images from an MRI and convert them from DICOM format to PNG. The data is read from and written to AWS S3 and we leverage numpy and the pydicom libraries to do the data transformation. We are using data from the "RSNA-MICCAI Brain Tumor Radiogenomic Classification" Kaggle competition but this approach can be used for general purpose DICOM processing.

How Cloudera DataFlow Enables Successful Data Mesh Architectures

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF), the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP), as a Data integration and Democratization fabric. Within the context of a data mesh architecture, I will present industry settings / use cases where the particular architecture is relevant and highlight the business value that it delivers against business and technology areas.

Struggling to Manage your Multi-Tenant Environments? Use Chargeback!

If your organization is using multi-tenant big data clusters (and everyone should be), do you know the usage and cost efficiency of resources in the cluster by tenants? A chargeback or showback model allows IT to determine costs and resource usage by the actual analytic users in the multi-tenant cluster, instead of attributing those to the platform (“overhead’) or IT department. This allows you to know the individual costs per tenant and set limits in order to control overall costs.

An Introduction to Ranger RMS

Cloudera Data Platform (CDP) supports access controls on tables and columns, as well as on files and directories via Apache Ranger since its first release. It is common to have different workloads using the same data – some require authorizations at the table level (Apache Hive queries) and others at the underlying files (Apache Spark jobs). Unfortunately, in such instances you would have to create and maintain separate Ranger policies for both Hive and HDFS, that correspond to each other.

Migrate to CDP Private Cloud Base - A Step by Step Guide

Our recent blog discussed the four paths to get from legacy platforms to CDP Private Cloud Base. In this blog and accompanying video, we will deep dive into the mechanics of running an in-place upgrade from CDH5 or CDH6 to CDP Private Cloud Base. The overall upgrade follows a seven-step process illustrated below. In the video below we walk through a complete end to end upgrade of CDH to CDP Private Cloud Base.

Future of Data Meetup: CDP on Azure - Industrial Strength Data Engineering

Data Engineering is undergoing a huge evolution requiring faster and more reliable data pipelines. Apache Spark and Python are core foundational components of this new architecture enabling data engineers to quickly develop these pipelines. They also introduce challenges when moving to production. Come join us as we: Ask questions and learn. We will also have a raffle of Cloudera swag.

Serving the Public Through Data

Digital transformation has been talked about for many years, but the pandemic has accelerated the digital transformation journeys for many enterprises. Forced to adapt to changes in the business landscape and customer behavior, businesses have adopted more digital tools and technologies to drive innovation and increase resilience.