Systems | Development | Analytics | API | Testing

Data Pipelines

Do you want to build an ETL pipeline?

Analysts and data scientists use SQL queries to pull data from the data storage underbelly of an enterprise. They mold the data, reshape it, and analyze it, so it can offer revenue-generating business insights to the company. But analytics is only as good as the material it works with. That is, if the underlying data is missing, compromised, incomplete, or wrong, so will the data analysis and inferences derived from it.

The Ultimate Guide to Building a Data Pipeline

Data is the new oil. Almost every industry is becoming more and more data-driven, and this trend will only continue to grow in the coming years. With so many organizations now relying on data for decision-making, they must easily access and analyze their information through data pipelines. This article will get you started on how to build your own data pipeline.

ETL Pipeline vs. Data Pipeline: What's the Difference?

ETL Pipeline and Data Pipeline are two concepts growing increasingly important as businesses keep adding applications to their tech stacks. More and more data is moving between systems, and this is where Data and ETL Pipelines play a crucial role. Take a comment on social media, for example. It might be picked up by your tool for social listening and registered in a sentiment analysis app.

Optimizing your BigQuery incremental data ingestion pipelines

When you build a data warehouse, the important question is how to ingest data from the source system to the data warehouse. If the table is small you can fully reload a table on a regular basis, however, if the table is large a common technique is to perform incremental table updates. This post demonstrates how you can enhance incremental pipeline performance when you ingest data into BigQuery.

Migrating Data Pipelines from Enterprise Schedulers to Airflow

At Airflow Summit 2021, Unravel’s co-founder and CTO, Shivnath Babu and Hari Nyer, Senior Software Engineer, delivered a talk titled Lessons Learned while Migrating Data Pipelines from Enterprise Schedulers to Airflow. This story, along with the slides and videos included in it, comes from the presentation.

Automating Data Pipelines in CDP with CDE Managed Airflow Service

When we announced the GA of Cloudera Data Engineering back in September of last year, a key vision we had was to simplify the automation of data transformation pipelines at scale. By leveraging Spark on Kubernetes as the foundation along with a first class job management API many of our customers have been able to quickly deploy, monitor and manage the life cycle of their spark jobs with ease. In addition, we allowed users to automate their jobs based on a time-based schedule.

Why Modernizing the First Mile of the Data Pipeline Can Accelerate all Analytics

Every enterprise is trying to collect and analyze data to get better insights into their business. Whether it is consuming log files, sensor metrics, and other unstructured data, most enterprises manage and deliver data to the data lake and leverage various applications like ETL tools, search engines, and databases for analysis. This whole architecture made a lot of sense when there was a consistent and predictable flow of data to process.