Saturday, December 22, 2018
Tuesday, November 27, 2018
Types of NoSQL
There are 4 basic types of NoSQL databases:
– It has a Big Hash Table of keys & values
e.g. MemcacheDB, Redis, Amazon Dynamo
The schema-less format of a key value database like Redis is just about what you need for your storage needs. The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.
CAP: Key value stores are great around the Availability and Partition aspects but definitely lack in Consistency.
- It stores documents made up of tagged elements
e.g. MongoDB, CouchDB
The data which is a collection of key value pairs is compressed as a document store quite similar to a key-value store, but the only difference is that the values stored (referred to as “documents”) provide some structure and encoding of the managed data. XML, JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are some common standard encodings.
One key difference between a key-value store and a document store is that the latter embeds attribute metadata associated with stored content, which essentially provides a way to query the data based on the contents.
- Each storage block contains data from only one column
e.g. HBase, Cassandra
In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. Column families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema. Read and write is done using columns rather than rows.
In comparison, most relational DBMS store data in rows, the benefit of storing data in columns, is fast search/ access and data aggregation. Relational databases store a single row as a continuous disk entry. Different rows are stored in different places on disk while Columnar databases store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster.
For example: To query the titles from a bunch of a million articles will be a painstaking task while using relational databases as it will go over each location to get item titles. On the other hand, with just one disk access, title of all the items can be obtained.
ColumnFamily: ColumnFamily is a single structure that can group Columns and SuperColumns with ease.
Key: the permanent name of the record. Keys have different numbers of columns, so the database can scale in an irregular way.
Keyspace: This defines the outermost level of an organization, typically the name of the application. Kind of like schema in RDBM.
Column: It has an ordered list of elements aka tuple with a name and a value defined.
- A network database that uses edges and nodes to represent and store data.
e.g. Neo4J
These databases that use edges and nodes to represent and store data.
These nodes are organized by some relationships with one another, which is represented by edges between the nodes.
Both the nodes and the relationships have some defined properties.
Key-Value Store
– It has a Big Hash Table of keys & values
e.g. MemcacheDB, Redis, Amazon Dynamo
The schema-less format of a key value database like Redis is just about what you need for your storage needs. The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.
CAP: Key value stores are great around the Availability and Partition aspects but definitely lack in Consistency.
Document-based Store
- It stores documents made up of tagged elements
e.g. MongoDB, CouchDB
The data which is a collection of key value pairs is compressed as a document store quite similar to a key-value store, but the only difference is that the values stored (referred to as “documents”) provide some structure and encoding of the managed data. XML, JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are some common standard encodings.
One key difference between a key-value store and a document store is that the latter embeds attribute metadata associated with stored content, which essentially provides a way to query the data based on the contents.
Column-based Store
- Each storage block contains data from only one column
e.g. HBase, Cassandra
In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. Column families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema. Read and write is done using columns rather than rows.
In comparison, most relational DBMS store data in rows, the benefit of storing data in columns, is fast search/ access and data aggregation. Relational databases store a single row as a continuous disk entry. Different rows are stored in different places on disk while Columnar databases store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster.
For example: To query the titles from a bunch of a million articles will be a painstaking task while using relational databases as it will go over each location to get item titles. On the other hand, with just one disk access, title of all the items can be obtained.
Data Model
ColumnFamily: ColumnFamily is a single structure that can group Columns and SuperColumns with ease.
Key: the permanent name of the record. Keys have different numbers of columns, so the database can scale in an irregular way.
Keyspace: This defines the outermost level of an organization, typically the name of the application. Kind of like schema in RDBM.
Column: It has an ordered list of elements aka tuple with a name and a value defined.
Graph-based
- A network database that uses edges and nodes to represent and store data.
e.g. Neo4J
These databases that use edges and nodes to represent and store data.
These nodes are organized by some relationships with one another, which is represented by edges between the nodes.
Both the nodes and the relationships have some defined properties.
Monday, November 26, 2018
Wednesday, November 14, 2018
Introduction to Mesos Architecture
Mesos consists of
1) a master daemon that manages agent daemons running on each cluster node
Allocation policy module
Enables fine-grained sharing of resources (CPU, RAM, …) across frameworks by making them resource offers.
Each resource offer contains a list of <agent ID, resource1: amount1, resource2: amount2, ...>
2) Mesos frameworks that run tasks on these agents.
Mesos framework consists of:
a scheduler that registers with the master to be offered resources
an executor process that is launched on agent nodes to run the framework’s tasks
1. Mesos slave reports available resources to Mesos master.
2. Based on allocation policy module Mesos master decides which framework to allocate these resources to. For example, It allocated to Framework 1.
3. Framework1 is free to accept/deny offered resources. For example, it accepts the offer
4. Master sends the tasks to the slave and Framework1 executor takes over. Mesos master may allocate any unused resource to other frameworks.
Two-Level Scheduling
Allocation Module decide resources for each framework
Framework Scheduler decide resources for each task
How can the constraints of a framework be satisfied without Mesos knowing about these constraints? For example, how can a framework achieve data locality without Mesos knowing which nodes store the data required by the framework? Mesos answers these questions by simply giving frameworks the ability to reject offers.
Scheduling Algorithm
Dominant Resource Fairness Algorithm (DRF)
DRF is a Min-Max Fairness Algorithm for heterogeneous resources
- CPU
- Memory
- IO
Similar Systems
YARN, Kubernetes, Docker Swarm
Friday, November 9, 2018
Sunday, November 4, 2018
Introduction to HBase
What is HBase
HBase is a column-oriented database management system (NoSQL) that runs on top of HDFS.High consistency
Fast scan operation
相比较下,不是那么available... (e.g. Major Compaction)
适合海量数据存储(如果数据⽐较⼩,使⽤HBase, 性能会⼤⼤降低。)
不适合OLTA这种需要快速响应的逻辑事物。
Architecture Design Strategy
HBase Structure
Master-Slave模式,有3种⻆⾊。- Hmaster: A cluster has one active Hmaster
Hmaster 跑在NameNode上⾯,监测所有Region Server的状态,处理所有 metadata 的更改 (Create/delete/update table),分配region, 监测region server压⼒的状态,如果某个 region server压⼒⽐较⼤,会将其分割,分配到其他的region server上,从⽽实现 loading balancing。
Hmaster是⼀个轻量级的master, 因为不处理数据本身,只处理⼀些元数据。Hmaster在 ⼀个cluster中,只有⼀个active Hmaster, 但会有多个back up Hmaster, 这就有了 redundancy.
Hmaster是⼀个轻量级的master, 因为不处理数据本身,只处理⼀些元数据。Hmaster在 ⼀个cluster中,只有⼀个active Hmaster, 但会有多个back up Hmaster, 这就有了 redundancy.
- Region Server: Handles all I/O requests,⼀个cluster有多个Region Server.
运⾏在Hadoop的Data Node上,存储实际的数据,得到很⾼的data locality, 处理所有的 I/O请求,⼀个Region Server可以处理1000个左右的region.
Region
Table的⽔平分割的产物,default 1GB in size。⼀个region包含了⽔平分割的start key和end key, ⼀个 region最⼤可以有1GB的size, Hmaster会将⼀个Region分配给⼀个Region Server, ⼀个Region Server上也会有多个regions, 任何的读写数据都会直接由Region Server 来处理。
Table的⽔平分割的产物,default 1GB in size。⼀个region包含了⽔平分割的start key和end key, ⼀个 region最⼤可以有1GB的size, Hmaster会将⼀个Region分配给⼀个Region Server, ⼀个Region Server上也会有多个regions, 任何的读写数据都会直接由Region Server 来处理。
- Zookeeper: 协调整个cluster.
Zookeeper存储了location of META(B-Tree), META holds每个 Region的位置信息。
Read/Write
Client⼀次读或者写的时候: ⾸先访问Zookeeper, 得到META的位置. Client请求META Table所在的Region Server, 得到读写那个Region真正的Region Server位置, 这个信息还会放到Client的cache中。 Client会去那个真正的Region上,去读或者写。以后的操作,Client不会去找 Zookeeper了,会使⽤⾃⼰的cache.Region Server Components
WAL: Write ahead log, 位于HDFS,为了加快写操作,client只要把数据的操作写⼊WAL log后, 就算是写⼊成功。 可以用于 WAL recovery来恢复丢失的 MemstoreBlockCache: 读操作的Cache, 存储最新的读数据,现在⼀般是使⽤LRU算法。
Memstore: 写操作的Cache, 存储最新写⼊的数据。
HFile: Memstore满了后,将数据⼀次性写 ⼊HFile.(sequecial write)
Read :
⾸先在BlockCache中寻找。
之后再MemStore中寻找。
最后去HFile中寻找。
优化:减少HFile的数量
Minor Compaction, 将多个 HFile merge成稍⼤的HFile。
或者Major Compaction, 将多个HFile合并成⼀个⾮常⼤的 HFile, 需要⼤量的I/O, Region Server处于不可⽤的状态。
Data Model
An HBase column represents an attribute of an object
Row: ⼀⾏数据。
Column Family
Column
HBase allows for many attributes to be grouped together into what is known as column families, such that the elements of a column family are all stored together.
With HBase, you must predefine the table schema and specify the column families.
Versioning
可以存多份不同版本的数据,适合回溯⼀些数据,HBase的读操作是针对最新版本的 value, 核⼼数据要存在最新的version中。
Subscribe to:
Posts (Atom)
Most Recent Posts
-
为什么我们做分布式使用Redis? Redis is single-threaded. Memory-based. Types: String, Hash, List, Set, Sorted Set Expiration: Redis 采用的是定期删除(ran...
-
原文 HTTP 协议 and HTTP/1.x 的缺陷 连接无法复用 连接无法复用会导致每次请求都经历三次握手和慢启动。三次握手在高延迟的场景下影响较明显,慢启动则对大量小文件请求影响较大(没有达到最大窗口请求就被终止)。 HTTP/1.0 传输数据时,每次都需要重新建立连接...
-
最近什么都不给力,无论怎么挣扎都没有好结果。 到后来,最后总是一个人寂寞的哭泣。 不过至少还有自己吧。 就算是不能相信别人,至少还有自己可以相信。 只要自己不要放弃,只要还活着,总能有办法的。
