Friday, October 19, 2018

Hadoop vs. Spark


Hadoop Ecosystem

Apache Hadoop Ecosystem

OOZIE: Scheduling
ZooKeeper: Management & Coordination
HBase: NoSQL Database
HIVE: Analytical SQL on Haddop

Hadoop platform
  • HDFS
  • YARN
  • MapReduce (file based)<=> Spark

Hadoop vs. Spark

(If Spark needs to read from HDFS, remote network bandwidth will be the bottleneck, sometimes worse than Hadoop => can hybrid MapReduce with Spark, Hadoop shuffle Spark algorithm)
For streamming, machine learning, Spark > Hadoop (lots of iteration)
For batch process, it depends
Spark cluster is more expensive than Hadoop cluster
Spark application hard to scale and need Spark expert to tune for speed
PB data, MapReduce is stable and good enough

Machine Learning & Algorithm

Linear regression
Logistic regression
Decision Tree
Neural network
Word2Vector

Weighted random sampling
Combination
Page Rank
N-Gram

Friday, October 12, 2018

纪录产品release前混乱的一天


一大清早。。。在家芭蕾训练。。。不能让办公室屎一般的事情影响锻炼

刚到办公室,立刻被犹太老头叫走说有个问题想确定一下,然后叨叨叨的阐述一个他对某段 code 的理解,听了半天不知道他的问题在哪里。。。也许人家就是想叨叨一下。。。

紧接着release前最后一次会议,确定最后这一天还能做的修修补补,会议期间发现俄国大姐本应提交一个重要fix但是居然没有提交(俄国大姐早已进入养老阶段,为什么会提交代码呢。。。?) project manager 火了,说了一句PM最不能说的一句话:我觉得这个问题很 easy 为什么你就不修好。。。全体dev沉默。。。本小姐不得不挺身而出:我帮实习生看完代码就跟俄国大姐一起修这个bug。会议勉强和平结束。

接下来,找实习生code review,跟犹太人讨论今早没有说完的问题,安抚俄国大姐情绪(”PM怎么@#$@$%#。。对了ZZ你给我一点 clue 吧去哪里找 bug 。。。“)三条 threads 同时在我的脑子和 slack 上进行着。

下午,一个申请人最后 onsite interview 来了。因为别人的时间安排一改再改,分给我的面试时间也被动一改再改,最后被迫从30分钟延长到1个小时,才等来姗姗来迟的老总接着面试。各种尬聊之后,终于回到电脑前,还好实习生最后的 refactor 通过,俄国大姐居然找到了bug,犹太老头不知道到底有没有从我这里得到答案,反正正在修改过程之中。。。一个小时后,居然全部搞定。。。提交 build 检查最后的 release notes 。。。

结束。脑瘫。

为什么我们产品还有客户,一定是更大的脑残。

Wednesday, October 10, 2018

Hadoop vs. SQL Comparison


Hadoop: Schema on Read
SQL: Schema on Write

Hadoop: Compressed files across multiple nodes in a cluster
SQL: relational databases and tables

On the event of a node failure
Hadoop: provide an immediate answer to the user, eventual consistency
SQL: hold up the entire response to the user, complete consistency -- two-phase commit

HIVE:
mimic SQL syntax to perform Hadoop

Tuesday, October 9, 2018

Memcached vs. Redis?


Memcached vs. Redis?

Redis is more powerful, more popular, and better supported than Memcached.
For anything new, use Redis.

Memcached: When you restart Memcached your data is gone.
Redis: You can turn off persistence and it will happily lose your data on restart too. If you want your cache to survive restarts it lets you do that as well. In fact, that's the default.

Memcached: limited to string
Redis: many Data Types

Sunday, October 7, 2018

My Regret...


People always say that “It's better to regret what you have done than what you haven't.”
Oh well, I regret that I did, and am still doing, something so wrong 4 years ago, and every second nowadays I am burning with my regrets.

My soul is weary with sorrow; strengthen me according to your word.

Most Recent Posts