Systems | Development | Analytics | API | Testing

Cloudera

The Art of Data Leadership | A discussion with Chief Digital Officer, Ray Kunik

Our Chief Data & Analytics Officer, Shayde Christian, sits down for a buzzworthy conversation with Chief Digital Officer Raymond L. Kunik Jr. to discuss the “other” CDO role, the science behind work-life integration, the impact and applications of #AI, and its correlation with a pretty sweet hobby.

Why Reinvent the Wheel? The Challenges of DIY Open Source Analytics Platforms

In their effort to reduce their technology spend, some organizations that leverage open source projects for advanced analytics often consider either building and maintaining their own runtime with the required data processing engines or retaining older, now obsolete, versions of legacy Cloudera runtimes (CDH or HDP).

Boosting Object Storage Performance with Ozone Manager

Ozone is an Apache Software Foundation project to build a distributed storage platform that caters to the demanding performance needs of analytical workloads, content distribution, and object storage use cases. The Ozone Manager is a critical component of Ozone. It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale.

Applied Machine Learning Prototypes | The Future of Machine Learning

Applied Machine Learning Prototypes or AMPs, are pre-built applications that can be used as a starting point for your next machine learning project. These prototypes are designed to save time and resources by providing a tested and reliable solution to common machine learning problems. Cloudera + Dell + AMD.

Unlock the Full Potential of Hive

In the realm of big data analytics, Hive has been a trusted companion for summarizing, querying, and analyzing huge and disparate datasets. But let’s face it, navigating the world of any SQL engine is a daunting task, and Hive is no exception. As a Hive user, you will find yourself wanting to go beyond surface-level analysis, and deep dive into the intricacies of how a Hive query is executed.

One Big Cluster Stuck: Environment Health Scorecard

Throughout the One Big Cluster Stuck series we’ve explored impactful best practices to gain control of your Cloudera Data platform (CDP) environment and significantly improve its health and performance. We’ve shared code, dashboards, and tools to help you on your health improvement journey. We’d like to provide one last tool.

From Hive Tables to Iceberg Tables: Hassle-Free

For more than a decade now, the Hive table format has been a ubiquitous presence in the big data ecosystem, managing petabytes of data with remarkable efficiency and scale. But as the data volumes, data variety, and data usage grows, users face many challenges when using Hive tables because of its antiquated directory-based table format. Some of the common issues include constrained schema evolution, static partitioning of data, and long planning time because of S3 directory listings.

12 Times Faster Query Planning With Iceberg Manifest Caching in Impala

Iceberg is an emerging open-table format designed for large analytic workloads. The Apache Iceberg project continues developing an implementation of Iceberg specification in the form of Java Library. Several compute engines such as Impala, Hive, Spark, and Trino have supported querying data in Iceberg table format by adopting this Java Library provided by the Apache Iceberg project.