Talend Big Data

Introduction

Built on top of Talend’s data integration solution, the big data solution is a powerful tool that enables users to access, transform, move and synchronize big data by leveraging the Apache Hadoop Big Data Platform and makes the Hadoop platform ever so easy to use.

Architecture

Big Data architecture is similar to Data Integration architecture.

Logical Reference Architecture

The Talend Big Data functional architecture is an architectural model that identifies Talend Big Data functions, interactions and corresponding IT needs. The overall architecture has been described by isolating specific functionalities in functional blocks. For more details on the individual components of Logical Architecture please refer Talend Help documentation

The following chart illustrates the main architectural functional blocks.

alt text

Highlights (Same as for Data Integration & in-addition below):

  • JobServers should be setup for plain DI ETL Jobs and for Big Data Jobs.
  • There should be a separation of concern.
  • While the Logical Reference Architecture diagram shows JobServer outside the cluster, we recommend, as a best practice, to install the - JobServer on the Edge nodes of a cluster.

Physical Reference Architecture

Edge Node or Gateway Node

  • Has all the necessary libraries and client components present
  • Has the current configuration of the cluster
  • Do not mix with actual cluster service nodes
  • Trusted by the cluster to have end users running CLI based tools like beeline, hdfs, hive, tools, etc.
  • You will require as many Edge Nodes as required to meet your solution requirements.

Recommendations

  • Job Server for Big Data Jobs should be installed on Cluster Edge Nodes
  • Job Server on Edge Nodes should be running Linux OS
  • Linux OS will allow impersonation
  • Use Job Server outside the cluster for pure DI batch jobs for separation of concerns

Environments

Workstation/Server Role OS CPU RAM SSD Disk Size
Client PC Windows/Linux/Mac 4 Cores i7 Processor or equivalent 16 GB 500 GB
Talend Administration Center Windows/Linux 4 Cores 8 GB RAM Minimum, 32 GB Recommended for 1000s of Jobs 300GB+ Minimum (for software & logs)
Job Server(s) Windows/Linux 4 Cores Minimum,8+ Cores Recommended 16 GB RAM Minimum,128 GB Recommended 300+ GB
Edge Node(s) Linux 4 Cores Minimum,32+ Cores Recommended 16 GB RAM Minimum,256 GB Recommended 1+ TB
Centralized Log Server Windows/Linux 4 Cores Minimum 16 GB RAM 300+ GB
Data Prep & Data Stewardship Server Windows/Linux 4 Cores Minimum 32 GB RAM 300+ GB
Shared Nexus Server Windows/Linux 4 Cores Minimum 8 GB RAM Minimum 300+ GB
Git Server (Better in Saas Mode) Windows/Linux 4 Cores Minimum 8 GB RAM Minimum 50+ GB
CI Server Windows/Linux 4 Cores Minimum,8 Cores Recommended 16 GB RAM 300+ GB