Introducing Pipeline Designer: Reinventing Data Integration
Why Pipeline Designer?
It is no secret that data has become a competitive edge of companies in every industry. And in order to maintain your competitive edge, your organization needs to ensure three things:
- That you are gathering all the data that will bring the best insights
- That business units depending on the data receive it in a timely fashion to make quick decisions
- That there is an easy way to scale and innovate as new data requirements arise.
Achieving this can be very difficult given the emergence of a multitude of new data types and technologies. For example, one of the biggest challenges that businesses face today is working with all types of streaming paradigms as well as dealing with new types of data permeating everywhere from social media, web, sensors, cloud and so on. Companies see processing and delivering data in real time as a game changer that can bring real-time insight, but easily collecting and transforming this data has proven to be a challenge.
Take clickstream data for example. Data is constantly being sent from websites and the stream of data is non-stop and flowing all the time. The typical batch approach to ingest or process data which relies on a definitive “start” and “stop” of the data is obsolete with streaming data and takes away the potential value of real-time reactivity to the data. For example, online retailers rely on clickstream data to understand their users’ engagement with their websites—which is essential to understanding how to target users with the products that they will purchase. In an industry with razor-thin margins, it is essential to have real-time insight to customer activity and competitor pricing data in order to make the fast decisions to win market share.
Additionally, if you are relying on data from different applications, your company’s data integration tool may not cope well with data format changes and data pipelines may break every time a new field is added to the source data. And even if IT is able to handle the dynamic nature of the data, the business units who need access to the data may have to wait several weeks before they can make any actionable insights due to the increasing amount of work put on those who are responsible for distributing the data to the rest of the business.
In fact, in a recent data scientist survey, over 30% of data scientists reported their top challenges as the unavailability of data and the difficulty accessing data, and the market demand for increased access to actionable data is further supported by job postings that show that there are 4-times more job openings for data engineers compared to data scientists.
The data engineering skillset –accessing, collecting, transforming, and delivering all types of data to the business—is in need, and data engineers today need to be more productive than ever before while working in a constantly-changing data environment. At the same time, ad hoc integrators need to be able to empower themselves to access and integrate their data, taking away their reliance on IT.
And last, with more of the business demanding quicker turnaround times, both data engineers and ad hoc integrators need to integrate their data right away, and their data integration tools need to help them meet these new demands. Data engineers and ad hoc integrators now require a born-in-the-cloud integration tool that is not only accessible and intuitive but is also capable of working with the variety and volumes of data that they work with every day.
These problems may sound daunting, but don’t worry. We wouldn’t make you read this far without having an answer.
Introducing Pipeline Designer
As we saw this scenario play out over and over again with customers and prospects, we knew we could help. That’s why we built Pipeline Designer.
Pipeline Designer is a self-service web UI, built in the cloud, that makes data integration faster, easier, and more accessible in an age where everyone expects easy-to-use cloud apps and where data volumes, types, and technologies are growing at a seemingly impossible pace.
It enables data engineers to quickly and easily address lightweight integration use cases including transforming and delivering data into cloud data warehouses, ingesting and processing streaming data into a cloud data lake, and bulk loading data into Snowflake and Amazon Redshift. Because of the modern architecture of Pipeline Designer, users can work with both batch and streaming data without needing to worry about completely rebuilding their pipelines to accommodate growing data volumes or changing data formats, ultimately enabling them to transform and deliver data faster than before.
So what makes Pipeline Designer so unique? Here are a few highlights we want to share with you:
The live preview capabilities in Pipeline Designer allows you to do continuous data integration design. You no longer need to design, compile, deploy, and run the pipeline to see what the data looks like.
Instead, you can see your data changes in real time, at every step of your design process, in the exact same design canvas. Click on any processor in your pipeline and see the data before and after your transformation to make sure the output data is exactly what you’re looking for. This will dramatically reduce development time and speed up your digital transformation projects.
As a quick example, let’s take a look at the input and output of the Python transformation below:
Schema-on-read is a data integration strategy for modern data integration like streaming data into big data platforms, messaging systems, and NoSQL. It saves time in not having to map incoming data, which is often less structured, to a fixed schema.
Pipeline Designer provides schema-on-read support removing the need to define schemas before building pipelines and keeps pipelines resilient when the schema changes. There is not a strong definition of schema when defining a connection or dataset in Pipeline Designer. The structure of the data is inferred at the moment the pipeline is run, i.e. it will gather data and guess its structure. If there is a change in the source schema, then at the next run, the pipeline will adapt to take into account the changes. This means you can start to work with your data immediately and add data sources “on-the-fly” because the schemas are dynamically discovered. In summary, it brings more resilience and flexibility compared to a “rigid” metadata definition.
Integrate Any Data with Unparalleled Portability
Talend has long been a leader in “future proofing” your development work. You model your pipeline, and can then select the platform to run it on (on-prem, cloud or big data). And when your requirements change, you just select a different platform. An example is when we turned our code generator from MapReduce to Spark, so you could turn your job to running optimized, native Spark in a few clicks. But now, it’s even better. By building on top of the open source project Apache Beam, we are able to decouple design and runtime, allowing you to build pipelines without having to think about the processing engine you will run your pipeline on.
Even more, you are able to design both streaming and batch pipelines in the same palette.
So you could plug the same pipeline on a bounded source, like a SQL query, or an unbounded source, for example, a message queue, and it will work as a batch pipeline or a stream pipeline simply based on the source of data. At runtime, you can choose to run natively in the cloud platform where your data resides, and you can even choose to run on EMR for ultimate scalability. Pipeline Designer truly achieves “design once and run anywhere” and allows you to run on multiple clouds in a scalable way.
Embedded Python Component
With Python being both the fastest growing programming language and a programming language commonly used by data engineers, we wanted Pipeline Designer to allow users to take advantage of their own Python skills and extend the tool to address any custom transformations they needed. So, Pipeline Designer embeds a Python component for scripting Python for customizable transformations.
Looking to put more data to work?
What’s even better with Pipeline Designer, is that it’s not a standalone app or a single point solution. It is part of the Talend Data Fabric platform, which solves some of the most complex aspects of the data value chain from end to end. With Data Fabric, users can collect data across systems, govern it to ensure proper use, transform it into new formats, improve its quality, and share it with internal and external stakeholders.
Pipeline Designer is managed by the same application as the rest of Talend Cloud: the Talend Management Console. This continuity ensures that IT is able to have a full view of the Talend platform, providing the oversight and governance that can only come from a unified platform like Talend Cloud. And of course, IT gets all the other benefits of Talend Data Fabric including being in control of data usage, so it’s easy to audit and to ensure privacy, security and data quality.
Users new to Talend can start with Pipeline Designer knowing that there is a suite of purpose-built applications that are designed to work with each other in order to support a culture of comprehensive data management that spans throughout the business. As your needs grow, Talend will be able to support you through your data journey.
We are excited to bring a free, zero-download trial of the product where you can see how Pipeline Designer can make lightweight integration easier. You can find more details of the product features on the product page here or try it free for 14-days!