Big Data

Big Data

A term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. Can be analyzed for insights that lead to better decisions and strategic business moves.

Big Data tools

Hadoop, Hive, Pig, Apache HBase, Cassandra, MapReduce (method), Spark.

Apache Flink

An open-source framework for distributed Big Data Analytics; written in Java and Scala

Apache Hive

A data warehouse system for Hadoop that summarizes, queries and analyzes data. It requires query language to be similar to SQL.

Apache Pig

A platform/tool for creating programs, analysing data sets representing them as data flows, etc. Usually, it is used in Hadoop

Apache Spark

An open-source lightning-fast cluster computing technology, designed for fast computation. Has in-memory cluster computing that increases the processing speed of an app.


A service for collecting and moving large amounts of log data.


An open-source software framework that is used for distributed storage and processing of big data sets across clusters of computers using simple programming models; the Apache project. 


An open-source distributed column-oriented database built on top of the Hadoop file system that provides quick random access to huge amounts of structured data. 


Stands for Hadoop Distributed File System. A distributed file system that is used to store large data sets on multiple nodes and is deployed on low-cost commodity hardware.


Stands for Hortonworks Data Platform. A secure, enterprise-ready open source distribution by Apache Hadoop that is based on a centralized architecture (YARN). 


A parallel processing SQL query engine for data stored in HDFS and Apache HBase without requiring any data transformation.


An open source distributed analytics engine created by eBay to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop for large datasets. 


An Apache Hadoop software framework for writing applications with big amounts of data on large clusters of commodity hardware in a reliable, fault-tolerant manner. 


A workflow scheduler designed for managing Hadoop jobs. A set of control flow and action nodes in a directed graph. 


A Java-based tool used for transferring bulk data between Apache Hadoop and structured datastores such as relational databases.


A data visualization tool used to create interactive visual analytics in the form of dashboards and generate compelling business insights. 


Stands for Yet Another Resource Negotiator, a large-scale, distributed operating system for Big Data applications, it is used for cluster management, a part of Apache Hadoop.

Development by