This website uses cookies
We use cookies to continuously improve your experience on our site. More info.
Data Science 

A distributed analytics engine that provides a SQL interface and multidimensional analysis (OLAP) on Hadoop supporting extremely large datasets. 

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratiolevel independent variables. 

Luigi is a python package to build complex pipelines and it was developed at Spotify. In Luigi, as in Airflow, you can specify workflows as tasks and dependencies between them. The two building blocks of Luigi are Tasks and Targets. 

Mathematics is the science of numbers and their operations, interrelations, combinations, generalizations, and abstractions and of space configurations and their structure, measurement, transformations, and generalizations. 

The heart of Apache Hadoop. A software framework for easily writing applications which process vast amounts of data (multiterabyte datasets) inparallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner. 

Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications. 

A machine learning library of highquality algorithms for Apache Spark. It supports R, Python, Java and Scala programming languages. It can run on Mesos, Hadoop and Kubernetes, and can extract data from a number of databases, such as Hive, Cassandra, HDFS, and HBase. 

MXNet is a deep learning library for GPU and cloud computing developers. It is an acceleration library that helps save time on building and deploying largescale DNNs. It also offers predefined layers and tools for coding your own, for specifying data structure placement and automating calculations. 

Neural networks are a set of algorithms, modeled loosely after the human brain that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. 

Optimization is the process of modifying a system to make some features of it work more efficiently or use fewer resources. For instance, a computer program may be optimized so that it runs faster, or to run with less memory requirements or other resources, or to consume less energy. 

Pattern recognition is the process of recognizing patterns by using machine learning algorithm. Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation. 

The process of using data and statistical techniques to forecast future outcomes. 

A workflow scheduler system designed to manage Hadoop jobs. Oozie allows to automates commonly performed tasks. By using it, you can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle. 

An inmemory, business discovery tool. Provides selfservice BI for all business users in organizations. Enables users to conduct direct and indirect searches across all data anywhere in the application. 

Quantitative analytics (QA) is a technique that seeks to understand behavior by using mathematical and statistical modeling, measurement, and research. Quantitative analysts aim to represent a given reality in terms of a numerical value. 

Quantitative finance is the use of mathematical models and extremely large datasets to analyze financial markets and securities. Common examples include (1) the pricing of derivative securities such as options, and (2) risk management, especially as it relates to portfolio management applications. 

A supervised learning algorithm that randomly creates and merges multiple decision trees into one “forest.” 

A distributed stream processing framework. It has been developed in conjunction with Apache Kafka. Allows to build stateful applications that process data in realtime from multiple sources. Continuously computes results as data arrives which makes subsecond response times possible. 

A big data transformations tool that can process large volumes of information and doesn't require a special set of skills. The tool assists in cost and risk reduction and data analytics. It can be integrated with other Big Data tools and frameworks, including Amazon Elastic MapReduce and Azure HDInsights. 

A software library for advanced Natural Language Processing. Written in Python and Cython. Helps to build applications that process large volumes of text. 

Spark MLlib is Apache Spark’s Machine Learning component. One of the major attractions of Spark is the ability to scale computation massively, and that is exactly what you need for machine learning algorithms. 

Spark Streaming is an extension of the core Spark API that enables scalable, highthroughput, faulttolerant stream processing of live data streams. 

A Javabased tool used for transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 

Statistical modeling is a simplified, mathematicallyformalized way to approximate reality (i.e. what generates your data) and optionally to make predictions from this approximation. The statistical model is the mathematical equation that is used. 

Statistics is a Mathematical Science pertaining to data collection, analysis, interpretation and presentation. Statistics can be used to derive meaningful insights from data by performing mathematical computations on it. 