Cloudera Search Training

Overview

Cloudera Educational Services three-day Search training course is for developers and data engineer who want to index data in Hadoop for more powerful real-time queries. Participants will learn to get more value from their data by integrating Cloudera Search with external applications.

Learn a modern toolset

Cloudera Search brings full-text, interactive search and scalable, flexible indexing to Hadoop and an enterprise data hub. Powered by Apache Solr, Search delivers scale and reliability for a new generation of integrated, multi-workload queries.

Get hands-on experience

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

  • Perform batch indexing of data stored in HDFS and HBase
  • Perform indexing of streaming data in near-real-time with Flume
  • Index content in multiple languages and file formats
  • Process and transform incoming data with Morphlines
  • Create a user interface for your index using Hue
  • Integrate Cloudera Search with external applications
  • Improve the Search experience using features such as faceting, highlighting, spelling correction

What to expect

This course is intended for developers and data engineers with at least basic familiarity with Hadoop and experience programming in a general-purpose language such as Java, C, C++, Perl, or Python. Participants should be comfortable with the Linux command line and should be able to perform basic tasks such as creating and removing directories, viewing and changing file permissions, executing scripts, and examining file output. No prior experience with Apache Solr or Cloudera Search is required, nor is any experience with HBase or SQL.

Cloudera Data Science Workbench Training

Overview

Cloudera Data Science Workbench Training prepares learners to complete data science and machine learning projects using Cloudera Data Science Workbench (CDSW).

Get Hands-On Experience

Through narrated demonstrations and hands-on exercises, learners achieve proficiency in CDSW and develop the skills required to:

  • Navigate CDSW’s options and interfaces with confidence
  • Create projects in CDSW and collaborate securely with other users and teams
  • Develop and run reproducible Python and R code
  • Customize projects by installing packages and setting environment variables
  • Connect to a secure (Kerberized) Cloudera or Hortonworks cluster
  • Work with large-scale data using Apache Spark 2 with PySpark and sparklyr
  • Perform end-to-end machine learning workflows in CDSW using Python or R (read, inspect, transform, visualize, and model data)
  • Measure, track, and compare machine learning models using CDSW’s Experiments capability
  • Deploy models as REST API endpoints serving predictions using CDSW’s Models capability
  • Work collaboratively using CDSW together with Git

What to Expect

This OnDemand course is designed for learners at organizations using CDSW under a trial license or a commercial license. The learner must have access to a CDSW environment on a Cloudera or Hortonworks cluster running Apache Spark 2. Some experience with data science using Python or R is helpful but not required. No prior knowledge of Spark or other Hadoop ecosystem tools is required.

Cloudera Developer Training for Spark and Hadoop

Overview

This four-day hands-on training course delivers the key concepts and expertise developers need to use Apache Spark to develop high-performance parallel applications. Participants will learn how to use Spark SQL to query structured data and Spark Streaming to perform real-time processing on streaming data from a variety of sources. Developers will also practice writing applications that use core Spark to perform ETL processing and iterative algorithms. The course covers how to work with “big data” stored in a distributed file system, and execute Spark applications on a Hadoop cluster. After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.

Prerequisites

This course is designed for developers and engineers who have programming experience, but prior knowledge of Spark and Hadoop is not required. Apache Spark examples and hands-on exercises are presented in Scala and Python. The ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful.

Course Objectives

  • How the Apache Hadoop ecosystem fits in with the data processing lifecycle
  • How data is distributed, stored, and processed in a Hadoop cluster
  • How to write, configure, and deploy Apache Spark applications on a Hadoop cluster
  • How to use the Spark shell and Spark applications to explore, process, and analyze distributed data
  • How to query data using Spark SQL, DataFrames, and Datasets
  • How to use Spark Streaming to process a live data stream

Security Training

Overview

Get the Knowledge and Skills

After successfully completing this course, the student will be able to:

  • Describe security in the context of Hadoop
  • Assess threats to a production Hadoop cluster
  • Plan and deploy defenses against these threats
  • Improve the security of each node in the cluster
  • Set up authentication with Kerberos and Active Directory
  • Use permissions and ACLs to control access to files in HDFS
  • Use platform authorization features to control data access
  • Perform common key management tasks
  • Use encryption to protect data in motion and at rest
  • Monitor a cluster for suspicious activity

What To Expect

The course is intended for system administrators and those in similar roles. Prospective students should have a good understanding of Hadoop’s architecture, the ability to perform system administration tasks in the Linux environment, and at least basic exposure to Cloudera Manager. We recommend that students complete the Cloudera Administrator Training for Apache Hadoop course, or have equivalent on-the-job experience, before beginning this course. No prior training or experience with computer security is required.

Cloudera DataFlow: Flow Management with Apache NiFi

Overview

This three-day hands-on training course provides the fundamental concepts and experience necessary to automate the ingest, flow, transformation, and egress of data using Apache NiFi.

Along with gaining a grasp of the key features, concepts, and benefits of NiFi, participants will create and run NiFi dataflows for a variety of scenarios. Students will gain expertise using processors, connections, and process groups, and will use NiFi Expression Language to control the flow of data from various sources to multiple destinations. Participants will monitor dataflows, examine progress of data through a dataflow, and connect dataflows to external systems such as Kafka and HDFS. After taking this course, participants will have key knowledge and expertise for configuring and managing data ingestion, movement, and transformation scenarios for the enterprise.

What You Will Learn

Students who successfully complete this course will be able to:

  • Understand the role of Apache NiFi and MiNiFi in the Cloudera DataFlow platform
  • Describe NiFi’s architecture, including standalone and clustered configurations
  • Use key features, including FlowFiles, processors, process groups, controllers, and connections, to define a NiFi dataflow
  • Navigate, configure dataflows, and use dataflow information with the NiFi User Interface
  • Trace the life of data, its origin, transformation, and destination, using data provenance
  • Organize and simplify dataflows
  • Manage dataflow versions using the NiFi Registry
  • Use the NiFi Expression Language to control dataflows
  • Implement dataflow optimization methods and available monitoring and reporting features
  • Connect dataflows with other systems, such as Kafka and HDFS
  • Describe aspects of NiFi security

What to Expect

This course is designed for Developers, Data Engineers, Data Scientists, and Data Stewards. It provides a no-code, graphical approach to configuring real-time data streaming, ingestion, and management solutions for a variety of use cases. Though programming experience is not required, basic experience with Linux is presumed. Exposure to big data concepts and applications is helpful.