TI Corporativa

Construindo conhecimento em TI

Este documento detalha os eventos que ocorram após a falha no App Engine da Google em 24 de fevereiro de 2010, a causa raiz do problema e as ações que estão sendo tomadas para atenuar o impacto de interrupções futuras. Um bom exemplo de transparência e melhoria contínua.

What did we do wrong?

Though the team had planned for this sort of failure, our response had
a few important issues:

- Although we had procedures ready for this sort of outage, the oncall
staff was unfamiliar with them and had not trained sufficiently with
the specific recovery procedure for this type of failure.

- Recent work to migrate the datastore for better multihoming changed
and improved the procedure for handling these failures significantly.
However, some documentation detailing the procedure to support the
datastore during failover incorrectly referred to the old
configuration. This led to confusion during the event.

- The production team had not agreed on a policy that clearly
indicates when, and in what situations, our oncall staff should take
aggressive user-facing actions, such as an unscheduled failover. This
led to a bad call of returning to a partially working datacenter.

- We failed to plan for the case of a power outage that might affect
some, but not all, of our machines in a datacenter (in this case,
about 25%). In particular, this led to incorrect analysis of the
serving state of the failed datacenter and when it might recover.

- Though we were able to eventually migrate traffic to the backup
datacenter, a small number of Datastore entity groups, belonging to
approximately 25 applications in total, became stuck in an
inconsistent state as a result of the failover procedure. This
represented considerably less than 0.00002% of data stored in the
Datastore.

Ultimately, although significant work had been done over the past year
to improve our handling of these types of outages, issues with
procedures reduced their impact.

What are we doing to fix it?

As a result, we have instituted the following procedures going
forward:

- Introduce regular drills by all oncall staff of all of our
production procedures. This will include the rare and complicated
procedures, and all members of the team will be required to complete
the drills before joining the oncall rotation.

- Implement a regular bi-monthly audit of our operations docs to
ensure that all needed procedures are properly findable, and all out-
of-date docs are properly marked "Deprecated."

- Establish a clear policy framework to assist oncall staff to quickly
and decisively make decisions about taking intrusive, user-facing
actions during failures. This will allow them to act confidently and
without delay in emergency situations.

We believe that with these new procedures in place, last week's outage
would have been reduced in impact from about 2 hours of total
unavailability to about 10 to 20 minutes of partial unavailability.

In response to this outage, we have also decided to make a major
infrastructural change in App Engine. Currently, App Engine provides a
one-size-fits-all Datastore, that provides low write latency combined
with strong consistency, in exchange for lower availability in
situations of unexpected failure in one of our serving datacenters. In
response to this outage, and feedback from our users, we have begun
work on providing two different Datastore configurations:

- The current option of low-latency, strong consistency, and lower
availability during unexpected failures (like a power outage)

- A new option for higher availability using synchronous replication
for reads and writes, at the cost of significantly higher latency

We believe that providing both of these options to you, our users,
will allow you to make your own informed decisions about the tradeoffs
you want to make in running your applications.

We sincerely apologize for the impact of Feb 24th's service disruption
on your applications. We take great pride in the reliability that App
Engine offers, but we also recognize that we can do more to improve
it. You can be confident that we will continue to work diligently to
improve the service and ensure the impact of low level outages like
this have the least possible affect on our customers.

Timeline
-----------

7:48 AM - Internal monitoring graphs first begin to show that traffic
has problems in our primary datacenter and is returning an elevated
number of errors. Around the same time, posts begin to show up in the
google-appengine discussion group from users who are having trouble
accessing App Engine.

7:53 AM - Google Site Reliabilty Engineers send an email to a broad
audience notifying oncall staff that there has been a power outage in
our primary datacenter. Google's datacenters have backup power
generators for these situations. But, in this case, around 25% of
machines in the datacenter did not receive backup power in time and
crashed. At this time, our oncall staff was paged.

8:01 AM - By this time, our primary oncall engineer has determined the
extent and the impact of the page, and has determined that App Engine
is down. The oncall engineer, according to procedure, pages our
product managers and engineering leads to handle communicating about
the outage to out users. A few minutes later, the first post from the
App Engine team about this outage is made on the external group ("We
are investigating this issue.").

8:22 AM - After further analysis, we determine that although power has
returned to the datacenter, many machines in the datacenter are
missing due to the power outage, and are not able to serve traffic.
Particularly, it is determined that the GFS and Bigtable clusters are
not in a functioning state due to having lost too many machines, and
that thus the Datastore is not usable in the primary datacenter at
that time. The oncall engineer discusses performing a failover to our
alternate datacenter with the rest of the oncall team. Agreement is
reached to pursue our unexpected failover procedure for an unplanned
datacenter outages.

8:36 AM - Following up on the post on the discussion group outage
thread, the App Engine team makes a post about the outage to our
appengine-downtime-notify group and to the App Engine Status site.

8:40 AM - The primary oncall engineer discovers two conflicting sets
of procedures. This was a result of the operations process changing
after our recent migration of the Datastore. After discussion with
other oncall engineers, consensus is not reached, and members of the
engineering team attempt to contact the specific engineers responsible
for procedure change to resolve the situation.

8:44 AM - While others attempt to determine which is the correct
unexpected failover procedure, the oncall engineer attempts to move
all traffic into a read-only state in our alternate datacenter.
Traffic is moved, but an unexpected configuration problem from this
procedure prevents the read-only traffic from working properly.

9:08 AM - Various engineers are diagnosing the problem with read-only
traffic in our alternate datacenter. In the meantime, however, the
primary oncall engineer sees data that leads them to believe that our
primary datacenter has recovered and may be able to serve. Without a
clear rubric with which to make this decision, however, the engineer
was not aware that based on historical data the primary datacenter is
unlikely to have recovered to a usable state by this point of time.
Traffic is moved back to the original primary datacenter as an attempt
to resume serving, while others debug the read-only issue in the
alternate datacenter.

9:18 AM - The primary oncall engineer determines that the primary
datacenter has not recovered, and cannot serve traffic. It is now
clear to oncall staff that the call was wrong, the primary will not
recover, and we must focus on the alternate datacenter. Traffic is
failed back over to the alternate datacenter, and the oncall makes the
decision to follow the unplanned failover procedure and begins the
process.

9:35 AM - An engineer with familiarity with the unplanned failover
procedure is reached, and begins providing guidance about the failover
procedure. Traffic is moved to our alternate datacenter, initially in
read-only mode.

9:48 AM - Serving for App Engine begins externally in read-only mode,
from our alternate datacenter. At this point, apps that properly
handle read-only periods should be serving correctly, though in a
reduced operational state.

9:53 AM - After engineering team consultation with the relevant
engineers, now online, the correct unplanned failover procedure
operations document is confirmed, and is ready to be used by the
oncall engineer. The actual unplanned failover procedure for reads and
writes begins.

10:09 AM - The unplanned failover procedure completes, without any
problems. Traffic resumes serving normally, read and write. App Engine
is considered up at this time.

10:19 AM - A follow-up post is made to the appengine-downtime-notify
group, letting people know that App Engine is now serving normally.


https://groups.google.com/group/google-appengine/browse_thread/thre...

Tags: google

Comentar

Você precisa ser um membro de TI Corporativa para adicionar comentários!

Entrar em TI Corporativa

Membros

  • caio mariani de oliveira
  • franciara de souza oliveira
  • Sérgio da Silva
  • Sidney Reis dos Santos
  • João Francisco da Silva Soares
  • Andre Luis Garbin Veloso
  • Leonardo Henrique Maciel da Mota
  • Eltern de Assis
  • Thiago Sampaio
  • Tatiane Oliveira
  • Marcia Amazonas
  • Ricardo Nilsen Moreno
  • Edson A. Matos Neto
  • Rui Ribeiro Natal
  • Janete Corrêa

Últimas atividades

caio mariani de oliveira é agora um membro de TI Corporativa
ontem
franciara de souza oliveira é agora um membro de TI Corporativa
terça-feira
Sérgio da Silva e Sidney Reis dos Santos entraram em TI Corporativa
sábado
Alexandre Müller Alguém teria disponível um modelo de contrato de manutenção e suporte para micros para enviar por email? - alexandre.muller@uol.com.br
julho 12
João Francisco da Silva Soares é agora um membro de TI Corporativa
julho 11
Alexandre Müller As Darren Levine could not come to Krav Maga 2010 in Rio, he has send a video message you can see now at http://ning.it/aQWiPd
julho 1
Andre Luis Garbin Veloso é agora um membro de TI Corporativa
julho 1
Leonardo Henrique Maciel da Mota atualizaram suas fotos do perfil
junho 25

Debate informal sobre tecnologia aplicada em corporações. Criado por Gilberto Biasoto. http://www.ticorporativa.com.br

© 2010   Criado por Gilberto   Powered by .

Badges  |  Relatar um incidente  |  Termos de serviço