Tags:, Posted in 互联网 写个评论

我对 Facebook 的运作方式着迷。这是个非常独特的环境,很难被复制(这个方式并不适合所有的公司,即使有些公司尝试过这么做)。下面这些笔记来自我和Facebook的许多朋友的交谈,关于他们开发、运维与软件发布等方面。

好像很多人都对 Facebook 感兴趣… 这家公司的工程师驱动文化(Developer-driven culture)已经被公众大加研究,并且其它其它公司也在探求是否/如何实现工程师驱动文化。Facebook 的内部流程实在够神秘,当然,工程师团队也会发布一些关于新功能以及部分内部系统公开备忘,不过这些大多数是”说明”类的文章(What),而非讲述”机制”(How)… 所以,外部人员很难明白 Facebook 的创新以及如何比其它公司做到更有效的对服务进行优化。我作为外部人员尝试深入理解 Facebook 的运作,汇集了几个月来的这些观察信息。出于对信息来源的隐私保护,我去掉了特定功能/产品的名字。我又等了6个月以后才发布这些记录,所以,有些信息肯定过时了。我希望发布这些信息会有助于了解 Facebook 的管理机制如何在组织中进行决策的推行而非逐步陷入混轮…很难说这与 Facebook 的成败或是 Facebook 的产品协作相关。我相信很多面向消费者的互联网公司会从 Facebook 这个案例受益。

*非常*感谢那些帮助我整理这篇文章的 Facebook 内部的朋友们。也要感谢项 epriest 和 fryfrog 这样的朋友,他们协助我进行对本文进行校正、编辑。

记录:

  • 截止到2010年6月,Facebook有将近2000名员工,10个月前只有大约1100人,一年之间差不多翻了一番!
  • 工程部和运维部是两个最大的部门,每个大概都有 400-500人。这两个部门人数大约占了公司的一半。
  • 产品经理(PM)与工程师的比例大约为1-7到1-10。
  • 每个工程师入职时,都要接受 4 到 6 周的 “Boot Camp” 培训,通过修复Bug 和听更资深的工程师的课程来熟悉 Facebook 系统。每次 Boot Camp 大约有 10% 的人无法完成课程而被淘汰。
  • 培训结束后,每个工程师都可以访问线上的数据库【标准课程”能力越大,责任越大” ( “with great power comes great responsibility”) 对此有阐释,另有一份明晰的”不可触犯的天条”,比如共享用户的隐私数据】。
  • [修改, 感谢 fryfrog] “Facebook 有非常牢靠的安全保障,以免有人不小心/故意做了些糟糕的的事,你可以想象内部有人有这个权限的。如果你已经”成为”了需要别人支持的人,事由将被记录,并且有谨慎的审计。这里不允许钻空子。
  • 任何工程师都可以修改Facebook的代码库,签入(Check-in)代码
  • 浓厚的工程师驱动文化。”产品经理基本可以被忽略”,这是Facebook一名员工的话。工程师可以修改流程的细节,重新安排工作任务,随时植入自己的想法。[评论] “本文的作者是一个产品经理,所以这个论断引起里我的注意。你看完整篇文章后会发现,很显然,Facebook 的文化实际上是拥抱产品经理的实践的,所以,不是产品经理的角色被忽略,而是,这家公司的文化看上去是想让”每个人”感受到对产品的责任”。
  • 在每月的跨部门会议上,由工程师来汇报工作进度,市场部和产品经理会出席会议,也可以做些简短的发言,但如果长篇大论的话,将如实反馈给他们的主管,”产品人员在上次会议说的太多”。他们确实想让工程师来主导产品的开发,对自己的产品负责。
  • 项目需要的资源都是自发征集的:
  • 某个产品经理把工程师们召集起来,让他们对自己的想法产生兴趣。
  • 工程师们决定开发那些让他们感兴趣的特性。
  • 工程师跟他们的经理说:”我下周想开发这5个新特性”。
  • 经理会让工程师独立开发,可能有时会让他优先完成一些特性。
  • 工程师独立完成所有的特性 — 前端 JavaScript/后端数据库,等等所有相关的部分。如果需要得到设计人员的帮助,需要先让设计人员对你的想法产生兴趣(专职的设计师很少)。请架构师帮忙也是如此。但总体来说,工程师要独立完成所有的任务。
  • 对于某个特性是否值得开发的争执,通常是这么解决的:花一个星期的时间实现,并在小部分用户中(如1%的内华达的用户)进行测试。
  • 工程师通常乐衷致力于架构、扩展性以及解决”难题”,那样能获得声望和尊敬。他们很难对前端项目或用户界面产生太大的兴趣。这跟其他业务为导向的公司可能正好相反,那些公司人人都想做客户能直接接触到的东西,然后会指着某个特定的用户体验说,”那是我做的”。在 Facebook,后端的东西,比如 News Feed 算法、广告投放算法、Memcache 优化等等,是工程师真正倾慕的项目。
  • News Feed 因为太重要了,扎克会亲自审查任何变动。这是个特例。
  • [更正, 感谢 epriest ]“所有的代码变更都要经过强制性的代码审查(比如一个或者多个工程师)。我相信这篇文章只是说 扎克并不自己审查每一个变更”。
  • [更正, 感谢 fryfrog ]“所有的修改至少要被一个人审查,而且这个系统可以让任何人很方便地审核其他人的代码,即使你没有邀请他。而未经审查的代码会造成恶劣的后果”。
  • 工程师负责测试、Bug 修复以及启动对自己项目的维护。有单元测试和集成测试的框架可用,但很少使用。
  • [更正, 感谢 fryfrog ] “补充一下,我们是有 QA 的,只是没有正式的 QA 组而已。每个办公室或通过VPN连接的员工会使用下一版的 Facebook,这个版本的 Facebook 会经常更新,通常比公开的早 1-12 小时。所有的员工被强烈建议提交 Bug,而且通常会很快被修复”。
  • 回复:很奇怪只有很少的 QA 或自动测试 — “大部分工程师都能写出基本没有bug的代码,只是在其他公司他们不需要这么做。如果有 QA 部门,他们只要把代码写完,扔给他们就行了” [编辑:请注意这是很主观的,我选择包括这部分内容是因为这和那些其它公司的标准开发实践完全相反]
  • 回复:很奇怪,缺少产品经理的影响和控制 — 产品经理是很独立的和自由的。产生影响力的关键是与工程师和工程师的管理者搞好关系。需要大致了解技术,不要提一些愚蠢的想法。
  • 默认情况下,所有提交的代码每打包一次(周二)。
  • 只要多一分努力,终于一天会发生改变。
  • 星期二的代码发布,需要所有提交过代码的工程师在场。
  • 发布开始前,工程师必须在一个特定的 IRC 频道上候命,否则将会被公开问责。
  • 运维团队通过逐步滚动的方式进行代码发布:
  • Facebook 有大约 60000 台服务器。
  • 有9个代码发布级别。
  • [更正 感谢 eriest] “九个级别并非同轴的(concentric)。有三个同轴的阶段(p1=内部发布, p2=小范围外部发布, p3=完整的外部发布),其余六个阶段是辅助层,比如内部工具、视频上传主机等等”。
  • 最小的级别只有6台服务器。
  • 比如,星期二的代码发布会先发布到 6 台服务器上(第一级),运维组会观测这 6 台服务器,保证代码正常工作,然后再提交到下一级。
  • 如果发布出现了问题(如报错等等),那么就停止下一级的部署,提交出错代码的工程师负责修复问题,然后从头继续发布。
  • 所以一次发布可能会经历几次重复:1-2-3-修复,回到 1, 1-2-3-4-5-修复, 回到1, 1-2-3-4-5-6-7-8-9。
  • 运维团队受过严格训练,很受尊敬,而且极具有业务意识。他们的工作指标不止包括分析错误日志,负载和内存使用状态等等,还包括用户行为。比如,如果一个新的发布导致一定比例的用户对 Facebook 功能进行声讨,运维团队将查看相关指标,可能基于他们的调查停掉该次发布。
  • 在发布过程中,运维组使用基于 IRC 的通知系统,可以通过 Facebook、Email、IRC、IM SMS 通知每一个工程师,如果需要他们注意的话。对运维组不做回应会被公开问责。
  • 代码一旦发布到第9级,并且稳定运行,本周的发布宣告结束 。
  • 如果一个特性没有按时完成,也没什么大不了的(除非外部依赖严重),下次完成时一并发布即可。
  • 如果被 SVN-blamed(应该指没按照规范提交代码会受到的惩罚)、公开问责(Public shamed, 示众?还是通告批评?)或工作经常疏忽就很可能被开除。”这是一个高效的文化”。不够高效或者不够聪明的员工会被剔除。管理层会在 6 个月的时间里观察你表现,”你不能适应这种文化,只能说再见”。每一级都是这个待遇,即使是 C 级别和 VP 级别,如果不够高效,也会被开除。
  • [更正, 感谢 epriest ] “人们不会因为导致 Bug 而被解雇,只有在发布他们的代码时导致问题,而他们恰恰又不在场(也找不到其他可以替代的人)”。
  • [更正, 感谢 epriest] “被问责不会导致解雇。我们特别尊重别人,原谅别人。大部分高级工程师都或多或少犯过一些严重的错误,包括我。但没有人因此被解雇”。
  • [更正, 感谢 fryfrog] “我也没有遇到过因为上面提到过的犯错而被解雇。我知道有人不小心将整个网站宕掉过。一旦有人犯错,他们会竭尽全力修复问题,也让其他人得到了教训。就我来看,这种公然蒙羞与被解雇的恐惧相比更为奏效”。

分析 Facebook 的研发文化如何随着时间演化是件非常有趣的事。特别是当公司发展壮大到数千员工的时候,这种文化是否还能够延续?

你觉得如何?在你公司里,”开发者驱动(developer-driven)文化” 将会可行么?

译者后记:很多时候是管中窥豹也是非常有趣的,而且,应该细致一点儿。另外,或许我们更应该关注为什么 Facebook 能够形成这样的文化。你说呢?

英文原文

I’m fascinated by the way Facebook operates.  It’s a very unique environment, not easily replicated (nor would their system work for all companies, even if they tried).  These are notes gathered from talking with many friends at Facebook about how the company develops and releases software.

Seems like others are also interested in Facebook…   The company’s developer-driven culture is coming under greater public scrutiny and other companies are grappling with if/how to implement developer-driven culture.   The company is pretty secretive about its internal processes, though.  Facebook’s Engineering team releases public Notes on new features and some internal systems, but these are mostly “what” kinds of articles, not “how”…  So it’s not easy for outsiders to see how Facebook is able to innovate and optimize their service so much more effectively than other companies.  In my own attempt as an outsider to understand more about how Facebook operates, I assembled these observations over a period of months.  Out of respect for the privacy of my sources, I’ve removed all names and mention of specific features/products.  And I’ve also waited for over six months to publish these notes, so they’re surely a bit out-of-date.   I hope that releasing these notes will help shed some light on how Facebook has managed to push decision-making “down” in its organization without descending into chaos…  It’s hard to argue with Facebook’s results or the coherence of Facebook’s product offerings.  I think and hope that many consumer internet companies can learn from Facebook’s example.

HUGE thanks to the many folks who helped put together this view inside of Facebook.   Thanks are also due to folks like epriest and fryfrog who have written up corrections and edits.

Notes:

  • as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago.  Nearly doubling staff in under a year!
  • the two largest teams are Engineering and Ops, with roughly 400-500 team members each.  Between the two they make up about 50% of the company.
  • product manager to engineer ratio is roughly 1-to-7 or 1-to-10
  • all engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers.  estimate 10% of each boot camp’s trainee class don’t make it and are counseled out of the organization.
  • after boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offenses”, e.g., sharing private user data)
  • [EDIT thx fryfrog] “There are also very good safe guards in place to prevent anyone at the company from doing the horrible sorts of things you can imagine people have the power to do being on the inside. If you have to “become” someone who is asking for support, this is logged along with a reason and closely reviewed. Straying here is not tolerated, period.”
  • any engineer can modify any part of FB’s code base and check-in at-will
  • very engineering driven culture.  ”product managers are essentially useless here.” is a quote from an engineer.  engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime.  [EDITORIAL] The author of this blog post is a product manager, so this sentiment really caught my attention.  As you’ll see in the rest of these notes, though, it’s apparent that Facebook’s culture has really embraced product management practices so it’s not as though the role of product management is somehow ignored or omitted.  Rather, the culture of the company seems to be set so that *everyone* feels responsibility for the product.
  • during monthly cross-team meetings, the engineers are the ones who present progress reports.  product marketing and product management attend these meetings, but if they are particularly outspoken, there is actually feedback to the leadership that “product spoke too much at the last meeting.”  they really want engineers to publicly own products and be the main point of contact for the things they built.
  • resourcing for projects is purely voluntary.
    • a PM lobbies group of engineers, tries to get them excited about their ideas.
    • Engineers decide which ones sound interesting to work on.
    • Engineer talks to their manager, says “I’d like to work on these 5 things this week.”
    • Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.
    • Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between.  If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on.  Same for Architect help.  But in general, expectation is that engineers will handle everything they need themselves.
  • arguments about whether or not a feature idea is worth doing or not generally get resolved by just spending a week implementing it and then testing it on a sample of users, e.g., 1% of Nevada users.
  • engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is.  can be hard to get engineers excited about working on front-end projects and user interfaces.  this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.”  At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.
  • commits that affect certain high-priority features (e.g., news feed) get code reviewed before merge. News Feed is important enough that Zuckerberg reviews any changes to it, but that’s an exceptional case.
  • [CORRECTION -- thx epriest] “There is mandatory code review for all changes (i.e., by one or more engineers). I think the article is just saying that Zuck doesn’t look at every change personally.”
  • [CORRECTION thx fryfrog] “All changes are reviewed by at least one person, and the system is easy for anyone else to look at and review your code even if you don’t invite them to. It would take intentionally malicious behavior to get un-reviewed code in.”
  • no QA at all, zero.  engineers responsible for testing, bug fixes, and post-launch maintenance of their own work.  there are some unit-testing and integration-testing frameworks available, but only sporadically used.
  • [CORRECTION thx fryfrog] “I would also add that we do have QA, just not an official QA group. Every employee at an office or connected via VPN is using a version of the site that includes all the changes that are next in line to go out. This version is updated frequently and is usually 1-12 hours ahead of what the world sees. All employees are strongly encouraged to report any bugs they see and these are very quickly actioned upon.”
  • re: surprise at lack of QA or automated unit tests — “most engineers are capable of writing bug-free code.  it’s just that they don’t have an incentive to do so at most companies.  when there’s a QA department, it’s easy to just throw it over to them to find the errors.”  [EDIT: please note that this was subjective opinion, I chose to include it in this post because of the stark contrast that this draws with standard development practice at other companies]
  • [CORRECTION thx epriest] “We have automated testing, including “push-blocking” tests which must pass before the release goes out. We absolutely do not believe “most engineers are capable of writing bug-free code”, much less that this is a reasonable notion to base a business upon.”
  • re: surprise at lack of PM influence/control — product managers have a lot of independence and freedom.  The key to being influential is to have really good relationships with engineering managers.  Need to be technical enough not to suggest stupid ideas.  Aside from that, there’s no need to ask for any permission or pass any reviews when establishing roadmaps/backlogs.  There are relatively few PMs, but they all feel like they have responsibility for a really important and personally-interesting area of the company.
  • by default all code commits get packaged into weekly releases (tuesdays)
  • with extra effort, changes can go out same day
  • tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
  • engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
  • ops team runs code releases by gradually rolling code out
    • facebook has around 60,000 servers
    • there are 9 concentric levels for rolling out new code
    • [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
    • the smallest level is only 6 servers
    • e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
    • if a release is causing any issues (e.g., throwing errors, etc.) then push is halted.  the engineer who committed the offending changeset is paged to fix the problem.  and then the release starts over again at level 1.
    • so a release may go thru levels repeatedly:  1-2-3-fix. back to 1. 1-2-3-4-5-fix.  back to 1.  1-2-3-4-5-6-7-8-9.
  • ops team is really well-trained, well-respected, and very business-aware.  their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior.  E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.
  • during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention.  not responding to ops team results in public shaming.
  • once code has rolled out to level 9 and is stable, then done with weekly push.
  • if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.
  • getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired.  ”it’s a very high performance culture”.  people that aren’t productive or aren’t super talented really stick out.  Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”.  this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.
  • [CORRECTION, thx epriest]  “People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”
  • [CORRECTION, thx epriest] “Getting blamed will NOT get you fired. We are extremely forgiving in this respect, and most of the senior engineers have pushed at least one horrible thing, myself included. As far as I know, no one has ever been fired for making mistakes of this nature.”
  • [CORRECTION, thx fryfrog] “I also don’t know of anyone who has been fired for making mistakes like are mentioned in the article. I know of people who have inadvertently taken down the site. They work hard to fix what ever caused the problem and everyone learns from it. The public shaming is far more effective than fear of being fired, in my opinion.”

It’ll be super interesting to see how Facebook’s development culture evolves over time — and especially to see if the culture can continue scaling as the company grows into the thousands-of-employees.

What do you think?  Would “developer-driven culture” work at your company?


 

  作者: Fenng 网址: http://www.dbanotes.net/arch/facebook_how_facebook_ships_code.html

无觅相关文章插件,快速提升流量

2011 年 02 月 11 日
此文章来源未知,如果您知道来源或是文章作者请在留言板提醒,保证第一时间更正
声明: 本站原创文章采用 BY-NC-SA 协议进行授权. 如果喜欢本站文章 欢迎订阅 什么是RSS? 如何订阅
转载请注明转自: 三月鸟社. Facebook 如何发布代码 (How Facebook Ships Code 译文)