Why Apache Pig?
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
This can take any data.
- Structured data
- Semi-structured data
- Unstructured data
This provides data operations filters, joins, ordering, etc., and nested data types tuples, bags, and maps missing from MapReduce.
This is easy to learn, easy to write, and easy to read.
- Data flow
- Language
- Reads like a series of steps
An ad-hoc way of creating and executing map-reduce jobs on very large data sets.
Extensible by User-defined function:
Java Python Javascript Ruby
Open Source and activity supported by the community.
Can pig replace MR completely?
No, it can not replace MR as there are several operation need to be done in MR code only.
Pig Data Model
Tuple
The tuple is an ordered set of fields that may contain different data types for each field.
[ 2 , Google , CA , USA , Good ]
This is a tuple.
Bag
A bag is a collection of a set of tuples and these tuples are a subset of Rows or entire rows of tables.
{[1,Apple],[2,Google,CA,USA],[4,JIO,MUM,INDIA,Good]}
This is a bag that contains part of tuples or a full tuple.
Map
A map has its key and value together which is represents the pair used to represent data.
{ Brand#Google , City#CA , Country#USA }
Where,
Brand => Key Google => Value
together they are maps.
Atom
Atom are basic data types that are used in all the language like String, int , float, long, double cahr[].