Awesome Big Data,了不起的大数据_hua_ed的博客-程序员秘密_给不起的最大数据。

技术标签: hadoop  big data  大数据  



A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.

Your contributions are always welcome!


  • Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
  • Tigon - High Throughput Real-time Stream Processing Framework.

Distributed Programming

  • AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
  • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
  • Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
  • Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
  • Apache Flink - high-performance runtime, and automatic program optimization.
  • Apache Gora - framework for in-memory data model and persistence.
  • Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
  • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache Pig - high level language to express data analysis programs for Hadoop.
  • Apache S4 - framework for stream processing, implementation of S4.
  • Apache Spark - framework for in-memory cluster computing.
  • Apache Spark Streaming - framework for stream processing, part of Spark.
  • Apache Storm - framework for stream processing by Twitter also on YARN.
  • Apache Samza - stream processing framework, based on Kafka and YARN.
  • Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
  • Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
  • Cascalog - data processing and querying library.
  • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
  • Concurrent Cascading - framework for data management/analytics on Hadoop.
  • Damballa Parkour - MapReduce library for Clojure.
  • Datasalt Pangool - alternative MapReduce paradigm.
  • DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
  • Facebook Corona - Hadoop enhancement which removes single point of failure.
  • Facebook Peregrine - Map Reduce framework.
  • Facebook Scuba - distributed in-memory datastore.
  • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
  • Google MapReduce - map reduce framework.
  • Google MillWheel - fault tolerant stream processing framework.
  • JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
  • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
  • Metamarkets Druid - framework for real-time analysis of large datasets.
  • Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
  • Nokia Disco - MapReduce framework developed by Nokia.
  • Pinterest Pinlater - asynchronous job execution system.
  • Pydoop - Python MapReduce and HDFS API for Hadoop.
  • Rackerlabs Blueflood - multi-tenant distributed metric processing system
  • Stratosphere - general purpose cluster computing framework.
  • Streamdrill - usefull for counting activities of event streams over different time windows and finding the most active one.
  • Tuktu - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
  • Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
  • Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
  • Twitter TSAR - TimeSeries AggregatoR by Twitter.

Distributed Filesystem

Document Data Model

  • Actian Versant - commercial object-oriented database management systems .
  • Crate Data - is an open source massively scalable data store. It requires zero administration.
  • Facebook Apollo - Facebook’s Paxos-like NoSQL database.
  • jumboDB - document oriented datastore over Hadoop.
  • LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
  • MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
  • MongoDB - Document-oriented database system.
  • RavenDB - A transactional, open-source Document Database.
  • RethinkDB - document database that supports queries like table joins and group by.

Key Map Data Model

Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns").

Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.

The former group is referred to as "key map data model" here. The line between these and the Key-value Data Model stores is fairly blurry.

The latter, being more about the storage format than about the data model, is listed under Columnar Databases.

You can read more about this distinction on Prof. Daniel Abadi's blog: Distinguishing two major types of Column Stores.

  • Apache Accumulo - distributed key/value store, built on Hadoop.
  • Apache Cassandra - column-oriented distributed datastore, inspired by BigTable.
  • Apache HBase - column-oriented distributed datastore, inspired by BigTable.
  • Facebook HydraBase - evolution of HBase made by Facebook.
  • Google BigTable - column-oriented distributed datastore.
  • Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
  • Hypertable - column-oriented distributed datastore, inspired by BigTable.
  • InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
  • Tephra - Transactions for HBase.
  • Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.

Key-value Data Model

  • Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
  • Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
  • Edis - is a protocol-compatible Server replacement for Redis.
  • ElephantDB - Distributed database specialized in exporting data from Hadoop.
  • EventStore - distributed time series database.
  • LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
  • Linkedin Voldemort - distributed key/value storage system.
  • Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
  • Redis - in memory key value datastore.
  • Riak - a decentralized datastore.
  • Storehaus - library to work with asynchronous key value stores, by Twitter.
  • Tarantool - an efficient NoSQL database and a Lua application server.
  • TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.

Graph Data Model

  • Apache Giraph - implementation of Pregel, based on Hadoop.
  • Apache Spark Bagel - implementation of Pregel, part of Spark.
  • ArangoDB - multi model distributed database.
  • Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
  • Google Cayley - open-source graph database.
  • Google Pregel - graph processing framework.
  • GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
  • GraphX - resilient Distributed Graph System on Spark.
  • Gremlin - graph traversal Language.
  • Infovore - RDF-centric Map/Reduce framework.
  • Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
  • MapGraph - Massively Parallel Graph processing on GPUs.
  • Neo4j - graph database writting entirely in Java.
  • OrientDB - document and graph database.
  • Phoebus - framework for large scale graph processing.
  • Titan - distributed graph database, built over Cassandra.
  • Twitter FlockDB - distributed graph database.

Columnar Databases

Note please read the note on Key-Map Data Model section.

  • Columnar Storage - an explanation of what columnar storage is and when you might want it.
  • Actian Vector - column-oriented analytic database.
  • C-Store - column oriented DBMS.
  • MonetDB - column store database.
  • Parquet - columnar storage format for Hadoop.
  • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
  • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
  • Google BigQuery Google's cloud offering backed by their pioneering work on Dremel.
  • Amazon Redshift Amazon's cloud offering, also based on a columnar datastore backend.

NewSQL Databases

  • Actian Ingres - commercially supported, open-source SQL relational database management system.
  • Amazon RedShift - data warehouse service, based on PostgreSQL.
  • BayesDB - statistic oriented SQL database.
  • CitusDB - scales out PostgreSQL through sharding and replication.
  • Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
  • Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
  • FoundationDB - distributed database, inspired by F1.
  • Google F1 - distributed SQL database built on Spanner.
  • Google Spanner - globally distributed semi-relational database.
  • H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
  • Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
  • HandlerSocket - NoSQL plugin for MySQL/MariaDB.
  • InfiniSQL - infinity scalable RDBMS.
  • MemSQL - in memory SQL database witho optimized columnar storage on flash.
  • NuoDB - SQL/ACID compliant distributed database.
  • Oracle Database - object-relational database management system.
  • Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
  • Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
  • SAP HANA - is an in-memory, column-oriented, relational database management system.
  • SenseiDB - distributed, realtime, semi-structured database.
  • Sky - database used for flexible, high performance analysis of behavioral data.
  • SymmetricDS - open source software for both file and database synchronization.
  • Map-D - GPU in-memory database, big data analysis and visualization platform
  • TiDB - TiDB is a distributed SQL database. Inspired by the design of Google F1.
  • VoltDB - claims to be fastest in-memory database

Time-Series Databases

  • Cube - uses MongoDB to store time series data.
  • Axibase Time Series Database - distributed time series database on top of HBase. Includes built-in Rule Engine, data forecasting and visualization.
  • InfluxDB - distributed time series database.
  • Kairosdb - similar to OpenTSDB but allows for Cassandra.
  • OpenTSDB - distributed time series database on top of HBase.
  • Prometheus - a time series database and service monitoring system
  • Newts - a time series database based on Apache Cassandra

SQL-like processing

  • Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
  • AMPLAB Shark - data warehouse system for Spark.
  • Apache Drill - framework for interactive analysis, inspired by Dremel.
  • Apache HCatalog - table and storage management layer for Hadoop.
  • Apache Hive - SQL-like data warehouse system for Hadoop.
  • Apache Optiq - framework that allows efficient translation of queries involving heterogeneous and federated data.
  • Apache Phoenix - SQL skin over HBase.
  • BlinkDB - massively parallel, approximate query engine.
  • Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
  • Concurrent Lingual - SQL-like query language for Cascading.
  • Datasalt Splout SQL - full SQL query engine for big datasets.
  • Facebook PrestoDB - distributed SQL query engine.
  • Google BigQuery - framework for interactive analysis, implementation of Dremel.
  • Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
  • RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
  • Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
  • SparkSQL - Manipulating Structured Data Using Spark.
  • Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
  • Stinger - interactive query for Hive.
  • Tajo - distributed data warehouse system on Hadoop.
  • Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Data Ingestion

  • Amazon Kinesis - real-time processing of streaming data at massive scale.
  • Apache Chukwa - data collection system.
  • Apache Flume - service to manage large amount of log data.
  • Apache Kafka - distributed publish-subscribe messaging system.
  • Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
  • Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
  • Facebook Scribe - streamed log data aggregator.
  • Fluentd - tool to collect events and logs.
  • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
  • Heka - open source stream processing software system.
  • HIHO - framework for connecting disparate data sources with Hadoop.
  • Kestrel - distributed message queue system.
  • LinkedIn Databus - stream of change capture events for a database.
  • LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
  • LinkedIn White Elephant - log aggregator and dashboard.
  • Logstash - a tool for managing events and logs.
  • Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
  • Pinterest Secor - is a service implementing Kafka log persistance.
  • Linkedin Gobblin - linkedin's universal data ingestion framework.
  • StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.

Service Programming

  • Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
  • Apache Avro - data serialization system.
  • Apache Curator - Java libaries for Apache ZooKeeper.
  • Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
  • Apache Thrift - framework to build binary protocols.
  • Apache Zookeeper - centralized service for process management.
  • Google Chubby - a lock service for loosely-coupled distributed systems.
  • Linkedin Norbert - cluster manager.
  • OpenMPI - message passing framework.
  • Serf - decentralized solution for service discovery and orchestration.
  • Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
  • Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
  • Twitter Elephant Bird - libraries for working with LZOP-compressed data.
  • Twitter Finagle - asynchronous network stack for the JVM.


Machine Learning

  • Apache Mahout - machine learning library for Hadoop.
  • brain - Neural networks in JavaScript.
  • Cloudera Oryx - real-time large-scale machine learning.
  • Concurrent Pattern - machine learning library for Cascading.
  • convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
  • Decider - Flexible and Extensible Machine Learning in Ruby.
  • ENCOG - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
  • etcML - text classification with machine learning.
  • Etsy Conjecture - scalable Machine Learning in Scalding.
  • Google Sibyl - System for Large Scale Machine Learning at Google.
  • GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
  • H2O - statistical, machine learning and math runtime for Hadoop.
  • MLbase - distributed machine learning libraries for the BDAS stack.
  • MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
  • MonkeyLearn - Text mining made easy. Extract and classify data from text.
  • nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
  • PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
  • SAMOA - distributed streaming machine learning framework.
  • scikit-learn - scikit-learn: machine learning in Python.
  • Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
  • Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
  • WEKA - suite of machine learning software.



System Deployment


  • Adobe spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
  • Apache Kiji - framework to collect and analyze data in real-time, based on HBase.
  • Apache Nutch - open source web crawler.
  • Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
  • Apache Tika - content analysis toolkit.
  • Domino - Run, scale, share, and deploy models — without any infrastructure.
  • Eclipse BIRT - Eclipse-based reporting system.
  • Eventhub - open source event analytics platform.
  • Hermes - asynchronous message broker built on top of Kafka.
  • HIPI Library - API for performing image processing tasks on Hadoop's MapReduce.
  • Hunk - Splunk analytics for Hadoop.
  • Imhotep - Large scale analytics platform by indeed.
  • MADlib - data-processing library of an RDBMS to analyze data.
  • Kylin - open source Distributed Analytics Engine from eBay.
  • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
  • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
  • Sense - Cloud Platform for Data Science and Big Data Analytics.
  • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
  • SparkR - R frontend for Spark.
  • Splunk - analyzer for machine-generated data.
  • Sumo Logic - cloud based analyzer for machine-generated data.
  • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
  • Warp - query by example tool for big data (OS X app)

Search engine and framework

MySQL forks and evolutions

  • Amazon RDS - MySQL databases in Amazon's cloud.
  • Drizzle - evolution of MySQL 6.0.
  • Google Cloud SQL - MySQL databases in Google's cloud.
  • MariaDB - enhanced, drop-in replacement for MySQL.
  • MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
  • Percona Server - enhanced, drop-in replacement for MySQL.
  • ProxySQL - High Performance Proxy for MySQL.
  • TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
  • WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

PostgreSQL forks and evolutions

  • HadoopDB - hybrid of MapReduce and DBMS.
  • IBM Netezza - high-performance data warehouse appliances.
  • Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
  • RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
  • Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
  • Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.

Memcached forks and evolutions

Embedded Databases

  • Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
  • BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
  • HanoiDB - Erlang LSM BTree Storage.
  • LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
  • LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
  • RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

  • BIME Analytics - business intelligence platform in the cloud.
  • Chartio - lean business intelligence platform to visualize and explore your data.
  • datapine - self-service business intelligence tool in the cloud.
  • Jaspersoft - powerful business intelligence suite.
  • Jedox Palo - customisable Business Intelligence platform.
  • Microsoft - business intelligence software and platform.
  • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
  • Pentaho - business intelligence platform.
  • Qlik - business intelligence and analytics platform.
  • Saiku - open source analytics platform.
  • SpagoBI - open source business intelligence platform.
  • Tableau - business intelligence platform.
  • Zoomdata - Big Data Analytics.
  • Jethrodata - Interactive Big Data Analytics.

Data Visualization

  • Airpal - Web UI for PrestoDB.
  • Arbor - graph visualization library using web workers and jQuery.
  • Banana - visualize logs and time-stamped data stored in Solr. Port of Kibana.
  • Bokeh - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
  • C3 - D3-based reusable chart library
  • CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
  • Chart.js - open source HTML5 Charts visualizations.
  • Chartist.js - another open source HTML5 Charts visualization.
  • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
  • Cubism - JavaScript library for time series visualization.
  • Cytoscape - JavaScript library for visualizing complex networks.
  • DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
  • D3 - javaScript library for manipulating documents.
  • D3Plus - A fairly robust set of reusable charts and styles for d3.js.
  • Echarts - Baidus enterprise charts.
  • Envisionjs - dynamic HTML5 visualization.
  • FnordMetric - write SQL queries that return SVG charts rather than tables
  • Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
  • Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
  • Google Charts - simple charting API.
  • Grafana - graphite dashboard frontend, editor and graph composer.
  • Graphite - scalable Realtime Graphing.
  • Highcharts - simple and flexible charting API.
  • IPython - provides a rich architecture for interactive computing.
  • Kibana - visualize logs and time-stamped data
  • Matplotlib - plotting with Python.
  • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
  • NVD3 - chart components for d3.js.
  • Peity - Progressive SVG bar, line and pie charts.
  • - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
  • Recline - simple but powerful library for building data applications in pure Javascript and HTML.
  • Redash - open-source platform to query and visualize data.
  • Sigma.js - JavaScript library dedicated to graph drawing.
  • Vega - a visualization grammar.
  • Zeppelin - a notebook-style collaborative data analysis.
  • Zing Charts - JavaScript charting library for big data.

Internet of things and sensor data

  • TempoIQ - Cloud-based sensor analytics.
  • 2lemetry - Platform for Internet of things.
  • Pubnub - Data stream network
  • ThingWorx - Rapid development and connection of intelligent systems
  • IFTTT - If this then that
  • Evrything- Making products smart

Interesting Readings

Interesting Papers

2013 - 2014

  • 2014 - Stanford - Mining of Massive Datasets.
  • 2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
  • 2013 - AMPLab - MLbase: A Distributed Machine-learning System.
  • 2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.
  • 2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
  • 2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  • 2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
  • 2013 - Metamarkets - Druid: A Real-time Analytical Data Store.
  • 2013 - Google - Online, Asynchronous Schema Change in F1.
  • 2013 - Google - F1: A Distributed SQL Database That Scales.
  • 2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
  • 2013 - Facebook - Scuba: Diving into Data at Facebook.
  • 2013 - Facebook - Unicorn: A System for Searching the Social Graph.
  • 2013 - Facebook - Scaling Memcache at Facebook.

2011 - 2012

  • 2012 - Twitter - The Unified Logging Infrastructurefor Data Analytics at Twitter.
  • 2012 - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.
  • 2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
  • 2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
  • 2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
  • 2012 - Microsoft - Paxos Made Parallel.
  • 2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  • 2012 - Google - Processing a trillion cells per mouse click.
  • 2012 - Google - Spanner: Google’s Globally-Distributed Database.
  • 2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
  • 2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
  • 2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

2001 - 2010

  • 2010 - Facebook - Finding a needle in Haystack: Facebook’s photo storage.
  • 2010 - AMPLab - Spark: Cluster Computing with Working Sets.
  • 2010 - Google - Storage Architecture and Challenges.
  • 2010 - Google - Pregel: A System for Large-Scale Graph Processing.
  • 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
  • 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
  • 2010 - Yahoo - S4: Distributed Stream Computing Platform.
  • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  • 2008 - AMPLab - Chukwa: A large-scale monitoring system.
  • 2007 - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.
  • 2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
  • 2006 - Google - Bigtable: A Distributed Storage System for Structured Data.
  • 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
  • 2003 - Google - The Google File System.


Data Visualization

Other Awesome Lists

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。


Android Wear 交互表面设计及Android 双击事件监测实现_android wear 点击事件_Sophia_Xwt的博客-程序员秘密

2015年8月20日在Android Devlopers Blog (上Google 给出了对Android Wear 最近更新的可交互式表面的解释,正好公司需要对已有的表面添加可交互功能,博主对其进行了一定的研究,现将其

EMNLP 2017 NMT with word prediction_sunno_ya的博客-程序员秘密

文章链接: 概述: 这篇文章针对传统的seq2seq模型decoder端的改进。 针对解决的问题: 1.传统的seq2seq模型,encoder端通过一个双向的LSTM或者GRU,生成一个固定维度的向量用来表示源端信息,我们用这个向量initial_state来表示源端的信息。因此源端所形成的initial_state在整个翻译


ubuntu16.04安装go1.13.4在/usr/local文件夹下,删除老版本的go文件夹sudo rm -rf ./go下载go安装包wget在当前目录,解压安装包至特定目录sudo tar -xzf go1.12.6.linux-amd64.tar.gz -C /usr/local到这里就好了。另附:vi /etc/profileexport GOPATH=$HOM


一组图片循环滑动在开发中经常用到,App的欢迎页、广告banner等等都会用到。成熟的第三方也比较多,个人用的比较多的是JCTopic,这个很轻量级,代码也不多,用起来也是比较方便。 看过源码之后整理了一下实现的思路和原理,我们先来分析一下我们要实现的这个功能——图片循环轮播,看到这个需求我们想到的就是一组图片能够左右滑动,并且无限循环。 要实现这个功能,我们肯定需要一个可滑动的view,那么...


文章目录安装与卸载常用命令1. tensorflow的安装与卸载1.1 查询安装版本,2. 安装需遵循官方原则及注意事项:3. 显卡所有驱动的安装过程(NVIDIA驱动/CUDA/cuDNN组件)及注意事项3.1 显卡驱动及CUDA的安装:3.2 CUDA\NVIDIA显卡驱动\cuDNN对应关系了解,如了解,可略过看第三条3.3 各驱动下载方式3.4 本文给出的安装搭配方式为3.5 cuDNN4...


MapReduce简介MapReduce是一种分布式计算模型,是Google提出的,主要用于搜索领域,解决海量数据的计算问题。 MR有两个阶段组成:Map和Reduce,用户只需实现map()和reduce()两个函数,即可实现分布式计算。MapReduce执行流程  MapReduce原理   MapReduce的执行步骤:1、Map任务处理  1.1 读...


关于spring MVC web.xml配置文件的编写总结(持续更新)_Francis-Yu的博客-程序员秘密

目录:[ - ]filterfilter-mappingservletservlet-mappingcontext-paramlistenerdescriptiondisplay-nameerror-pagewelcome-file-list在此记录下编写web.xml文件的一些小结:---------------------


下面给出反射案例:publicvoid testClassMethod() throws Exception{String className = "cn.xh.Demo1.Fanshe2";//2.方法名:可能在1给的类中,也可能在父类中,可能是私有方法,可能是共有方法String methodName = "metho...

从零开始山寨Caffe·壹:仰望星空与脚踏实地_caffe uml_langb2014的博客-程序员秘密

请以“仰望星空与脚踏实地”作为题目,写一篇不少于800字的文章。除诗歌外,文体不限。                                  ——2010·北京卷仰望星空规范性Caffe诞生于12年末,如果偏要形容一下这个框架,可以用"须敬如师长"。这是一份相当规范的代码,这个规范,不应该是BAT规范,那得是Google规范。很多自称码农的

linux 错误运行profile导致命令失效_「已注销」的博客-程序员秘密

在修改profile文件的时候,修改命令错误,导致大部分命令基本都不能使用,vi、ls命令也不能用。使用export PATH=/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin,然后就可以使用命令了。然后把profile改回来回复正常。shell命令基本都在/usr/bin,/usr/sbin,/bin,/sbin,/usr/X11R6/bin中有定...

往企业家蜕变的过程中该如何拥抱失败 9 - 有所为而有所不为_天地会珠海分舵的博客-程序员秘密



IDM无法下载.ts文件解决方案无法下载TS文件问题实例解决方案1.利用IDM获取下载链接2.利用.ts下载软件下载文件【N_m3u8DL-CLI_TS】软件下载软件使用方法下载.ts文件下载完成IDM下载链接无法下载TS文件问题实例点击下载后显示"由于法律原因,IDM无法下载此受保护数据......"解决方案1.利用IDM获取下载链接<1>点击IDM浮标, 选择想要下载的文件<2>复制IDM弹窗中的下载链接2.利用.ts下载软件下载文件【N