Friday, October 19, 2018

Hadoop vs. Spark


Hadoop Ecosystem

Apache Hadoop Ecosystem

OOZIE: Scheduling
ZooKeeper: Management & Coordination
HBase: NoSQL Database
HIVE: Analytical SQL on Haddop

Hadoop platform
  • HDFS
  • YARN
  • MapReduce (file based)<=> Spark

Hadoop vs. Spark

(If Spark needs to read from HDFS, remote network bandwidth will be the bottleneck, sometimes worse than Hadoop => can hybrid MapReduce with Spark, Hadoop shuffle Spark algorithm)
For streamming, machine learning, Spark > Hadoop (lots of iteration)
For batch process, it depends
Spark cluster is more expensive than Hadoop cluster
Spark application hard to scale and need Spark expert to tune for speed
PB data, MapReduce is stable and good enough

Machine Learning & Algorithm

Linear regression
Logistic regression
Decision Tree
Neural network
Word2Vector

Weighted random sampling
Combination
Page Rank
N-Gram

No comments:

Post a Comment

Most Recent Posts