Data Science


A distributed analytics engine that provides a SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets. 

Logistic regression

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).  Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.


Luigi is a python package to build complex pipelines and it was developed at Spotify. In Luigi, as in Airflow, you can specify workflows as tasks and dependencies between them. The two building blocks of Luigi are Tasks and Targets.


Mathematics is the science of numbers and their operations, interrelations, combinations, generalizations, and abstractions and of space configurations and their structure, measurement, transformations, and generalizations.


The heart of Apache Hadoop. A software framework for easily writing applications which process vast amounts of data (multi-terabyte datasets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. 

Microsoft Excel

Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications. 


A machine learning library of high-quality algorithms for Apache Spark. It supports R, Python, Java and Scala programming languages. It can run on Mesos, Hadoop and Kubernetes, and can extract data from a number of databases, such as Hive, Cassandra, HDFS, and HBase.


MXNet is a deep learning library for GPU and cloud computing developers. It is an acceleration library that helps save time on building and deploying large-scale DNNs. It also offers predefined layers and tools for coding your own, for specifying data structure placement and automating calculations.

Neural Networks

Neural networks are a set of algorithms, modeled loosely after the human brain that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input.


Optimization is the process of modifying a system to make some features of it work more efficiently or use fewer resources. For instance, a computer program may be optimized so that it runs faster, or to run with less memory requirements or other resources, or to consume less energy. 

Pattern Recognition

Pattern recognition is the process of recognizing patterns by using machine learning algorithm. Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation. 

Predictive Modeling

The process of using data and statistical techniques to forecast future outcomes. 


A workflow scheduler system designed to manage Hadoop jobs. Oozie allows to automates commonly performed tasks. By using it, you can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle.


An in-memory, business discovery tool. Provides self-service BI for all business users in organizations. Enables users to conduct direct and indirect searches across all data anywhere in the application.

Quantitative analytics

Quantitative analytics (QA) is a technique that seeks to understand behavior by using mathematical and statistical modeling, measurement, and research. Quantitative analysts aim to represent a given reality in terms of a numerical value.

Quantitative finance

Quantitative finance is the use of mathematical models and extremely large datasets to analyze financial markets and securities. Common examples include (1) the pricing of derivative securities such as options, and (2) risk management, especially as it relates to portfolio management applications.

Random Forest

A supervised learning algorithm that randomly creates and merges multiple decision trees into one “forest.”


A distributed stream processing framework. It has been developed in conjunction with Apache Kafka. Allows to build stateful applications that process data in real-time from multiple sources. Continuously computes results as data arrives which makes sub-second response times possible.

SnapLogic eXtreme

A big data transformations tool that can process large volumes of information and doesn't require a special set of skills. The tool assists in cost and risk reduction and data analytics. It can be integrated with other Big Data tools and frameworks, including Amazon Elastic MapReduce and Azure HDInsights.


A software library for advanced Natural Language Processing. Written in Python and Cython. Helps to build applications that process large volumes of text.

Spark MLlib

Spark MLlib is Apache Spark’s Machine Learning component. One of the major attractions of Spark is the ability to scale computation massively, and that is exactly what you need for machine learning algorithms. 

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.


A Java-based tool used for transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Statistical modeling

Statistical modeling is a simplified, mathematically-formalized way to approximate reality (i.e. what generates your data) and optionally to make predictions from this approximation. The statistical model is the mathematical equation that is used.


Statistics is a Mathematical Science pertaining to data collection, analysis, interpretation and presentation. Statistics can be used to derive meaningful insights from data by performing mathematical computations on it.

Development by

Sign up for updates
straight to your inbox