Hadoop Ecosystem

OOZIE: Scheduling
ZooKeeper: Management & Coordination
HBase: NoSQL Database
HIVE: Analytical SQL on Haddop
Hadoop platform
- HDFS
- YARN
- MapReduce (file based)<=> Spark
Hadoop vs. Spark
(If Spark needs to read from HDFS, remote network bandwidth will be the bottleneck, sometimes worse than Hadoop => can hybrid MapReduce with Spark, Hadoop shuffle Spark algorithm)For streamming, machine learning, Spark > Hadoop (lots of iteration)
For batch process, it depends
Spark cluster is more expensive than Hadoop cluster
Spark application hard to scale and need Spark expert to tune for speed
PB data, MapReduce is stable and good enough
Machine Learning & Algorithm
Linear regressionLogistic regression
Decision Tree
Neural network
Word2Vector
Weighted random sampling
Combination
Page Rank
N-Gram
No comments:
Post a Comment