Big Data Glossary

Apache Ambari

Ambari

Ambari

It is a web interface to implement and manage Apache Hadoop clusters. Its development is being led by Hortonworoks engineers, including "Ambari Hortonworks" on its data platform.

More information at http://ambari.apache.org

Sistema de serialización de datos optimizado para Hadoop/MapReduce.

Avro

Avro

It is a data serialization system optimized for Hadoop/MapReduce. With the advantage of being compact and flexible, it accepts several programming languages, which positions it as a very good alternative to Sequence Files (by Hadoop) or ProtoBuf (by Google).

More information at https://avro.apache.org

More formal process or reference framework for Hadoop

Bigtop

Bigtop

It is an effort to create a more formal process or reference framework for packetization testing and Hadoop subprojects interoperability and their related components, with the purpose of improving the whole Hadoop platform.

More information at http://bigtop.apache.org

Base de datos distribuida

Cassandra

Cassandra

It is a distributed database originally developed by Facebook. Designed to handle large amounts of data distributed across commodity servers. It has an architecture type "key/value", it doesn't have any point of failure (SPOF), a method of information replication based in "gossip protocol" and the "eventual consistency" problem.

More information at http://cassandra.apache.org

 Carga masiva de varios ficheros texto dentro de un Cluster Hadoop

Chukwa

Chukwa

It is a subproject dedicated to the massive load of several text files within a Hadoop cluster (ETL). Chukwa is built under the Distributed File System (HDFS) and MapReduce's framework and it inherits Hadoop's scalability and robustness. Chukwa also includes a set of flexible and powerful tools for he visualization and analysis of the results.

More information at https://chukwa.apache.org

Consultas interactivas para el análisis de datos anidados

Dremel

Dremel

It is an interactive system of queries for analyzing nested read-only data. It is an ad-hoc scalable solution, that by combining multiples levels of implementation and the desingn of data columns, is able to execute queries about agregation tables in a billion rows in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.

More information at http://research.google.com/pubs/pub36632.html

Marco para aportar datos a Hadoop

Flume

Flume

It is a framework to provide data to Hadoop. The agents have all the IT infrastructure within the web servers, application servers and mobile devices, to collect these data and integrate them in Hadoop.

More information at https://flume.apache.org

Apache Hama

Hama

Hama

It is a distributed computing platform based in parallel massive computing techniques, for example, scientific calculations, matrix, graph and network algorithms.

More information at http://hama.apache.org

Database NoSQL of low latency.

HBase

HBase

It is a low latency database NoSQL. It is the opensource java version of Hadoop of the famous Google DB NoSQL: BigTable. As main characteristics: data stored in columns, versioning data system, consistency of the writes and reads and automatic recovery in case of failures. It has been chosen by Facebook, among other things, to storage all the user e-mails of the same platform.

More information at https://hbase.apache.org

Capa de almacenamiento de Hadoop

HDFS

HDFS

Hadoop Distributed File System, the storage layer of Hadoop, is a distributed file system written in java, scalable, fault-tolerant. Although Hadoop can work with multiple file systems (local file system of Linux, GlusterFS, Aazon's S3...) HDFS is distinguished from the others because it is fully compatible with MapReduce and it offers optimization of "data locality" which makes it the "natural" solution for Hadoop.

More information at HDFS Architecture Guide.

Apache Kafka

Kafka

Kafka

(Developed by Linkedin). It is a distributed system of messaging publication-administration that offers a solution able to handle all the data flow activity and process these data at a high consumption website. These types of data (page views, searches and other user actions) are a key ingredient at the current social web.

More information at http://kafka.apache.org

Sistema de BBDD NoSQL orientado a documentos de código aberto

MongoDB

MongoDB

It is a NoSQL DB system oriented to open source documents. Since they are documents type, data structures are saved in documents with a dinamic scheme but following the JSON notation, these dynamic structures which are called by Mongo as BSON, which implies that there is no predefined schema, where a document may not have all the fields defined for this document which makes possible that integrating data at certain applications is easier and faster.

More information at https://www.mongodb.org

Base de datos para gráficos de código abierto

Neo4j

Neo4j

It is a grap database, open source supported by Neo Technology. Neo4j stores data in nodes connected by directed and typed relationships, with both properties, also known as Property Graph.

More information at http://neo4j.com

WorkFlow management system

Oozie

Oozie

It is a workflow management system which allows the users to define series of works written in several languages, such as MapReduce, Pig and Hive, creating logically among them a flow of processes (jobs). Oozie allows users to specify, for example, that a particular query must be initiated only after certain previous works on wihich it is based to collect completed data.

More information at http://oozie.apache.org

Desarrollado por Yahoo para facilitar la programación de MapReduce

Pig

Pig

It is a high level programming language developed by Yahoo to make easier MapReduce programming on Hadoop. It is relatively easy to learn (because it is very expressive and legible) and it is efficient against large data flows.

More information at https://pig.apache.org

Entorno para computación y gráficos estadísticos

R

R

It is a language and an environment for computing and stadistical graphics. It is a GNU project, similar to S language. R offers a large amount of statitics (linear and nonlinear models, classical statistical tests, time series analysis, clasification, clustering,..) and the graphical techniques. It is also highly extensible and there is also a popular IDE for R called RStudio.

More information at https://www.r-project.org

Databases NoSQL inspired in Dynamo

Riak

Riak

It is a Database NoSQL inspired in Dynamo, open source, distributed and with a commercial version. BBDD key-value with some metadata, without storage scheme, agnostic data type, agnostic language that supports through an api REST and PBC31 different types of languages (Eralng, Javascript, Java, PHP, Python, Ruby...), masterless since all the nodes are similar, scalable, eventually consistent and it uses map/reduce and “link”. Riak can solve a new kind of data managemente problems, specifically those related to the capture, storage and data processing in TI environments distributed and modern as the cloud.

Apache Sqoop

Sqoop

Sqoop

It is a connectivity tool to move data from Hadoop, such as relational databases and data warehouses. It allows the users to specify the target location within Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to meet the objective.

More information at http://sqoop.apache.org

Storm

Storm

It is a computing system dustributed in real time, free and open source. Storm makes it easy to process reliably unstructured data flows doing, in the field of real-time processing, what Hadoop did for batch processing.

More information at http://storm.apache.org

Project Voldemort

Voldemort

Voldemort

It is a distributed storage system based on key-value. It is used in Linkedin for certain storage problems with high scalability where a simple functional partition is not enough.

More information at http://www.project-voldemort.com

Software from the Apache Software Foundation

ZooKeeper

ZooKeeper

It is a software project of Apache Software Foundation, which provides a centralized service of configuration and registration of open code names for large distributed systems. ZooKeeper is a Hadoop subproject.

More information at https://zookeeper.apache.org