Why Apache Pig?

·

2 min read

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.

This can take any data.

  1. Structured data
  2. Semi-structured data
  3. Unstructured data

This provides data operations filters, joins, ordering, etc., and nested data types tuples, bags, and maps missing from MapReduce.

This is easy to learn, easy to write, and easy to read.

  1. Data flow
  2. Language
  3. Reads like a series of steps

An ad-hoc way of creating and executing map-reduce jobs on very large data sets.

Extensible by User-defined function:

Java Python Javascript Ruby

Open Source and activity supported by the community.

Can pig replace MR completely?

No, it can not replace MR as there are several operation need to be done in MR code only.

Pig Data Model

Tuple

The tuple is an ordered set of fields that may contain different data types for each field.

[ 2 , Google , CA , USA , Good ]

This is a tuple.

Bag

A bag is a collection of a set of tuples and these tuples are a subset of Rows or entire rows of tables.

{[1,Apple],[2,Google,CA,USA],[4,JIO,MUM,INDIA,Good]}

This is a bag that contains part of tuples or a full tuple.

Map

A map has its key and value together which is represents the pair used to represent data.

{ Brand#Google , City#CA , Country#USA }

Where,

Brand => Key Google => Value

together they are maps.

Atom

Atom are basic data types that are used in all the language like String, int , float, long, double cahr[].

423464.image0.jpg