CST438: Software Engineering – Week 6
We are approaching the final weeks of the Software
Engineering course. The team has been working on the project for the third
iteration of system testing and is making good progress. We look forward to
finalizing the portal and publishing it to Amazon AWS in the coming week. This
week focused on hands-on coding with Selenium libraries to automate the testing
and verification of data entries in the basic learning management system (LMS).
The process of software engineering and interaction with team members during
this project was extremely helpful in gaining experience with team coding. As
we progress, periodic communication with team members is essential for the
project's success. Using the GitHub organization model allowed the project's
progress to be shared among the team members as tasks advanced through each
stage.
The reading for this week focuses on computers as a service
and how Google scaled its infrastructure management by adapting to computing
automation and embracing containers as resources. Google runs two types of jobs
in its environment: batch jobs and serving jobs. The batch jobs are short-lived
and easy to start, primarily focusing on the process. The serving jobs are
mainly stateful, longer-running jobs, such as IIS services, load balancers, and
web services. The serving jobs focus on full loads, redundancy, and latency.
One of the challenges Google faced using containers is the locality of critical
state. For instance, losing a container means losing its local state, such as a
database or storage. The solution to this problem is to move all critical
states to external resources, while leaving non-critical and local state
recreation intact. In other words, containers become recyclables for
non-critical states. How about batch jobs contingent on the local state? Google
uses more frequent checkpoints for batch jobs to reduce the likelihood of loss.
Additionally, caches are refreshed and are not an issue if lost.
Abstraction in design is necessary not only for scalability
but also provides a layer of protection against indirect internal changes. For
instance, Hyrum's law exposes the unexpected use of the PID range in a leaky
abstraction design. PID max changes lead to log system breaks and naming
collisions. Over time, we will need CaaS that provides dynamic abstraction for
specific behaviors and withstands rapid changes – containers could help us with
that.
Google Borg changed the handling of batch jobs and serving
jobs on isolated machines, combining everything under a shared pool. As a
result, the unified compute system consolidated tools and administration,
lowering the CapEx and OpEx of operations. Borg also provided dynamic workloads
by reclaiming resources from batch jobs, overprovisioning unused capacity to
serving jobs, or allocating idle computing resources to batch jobs. Although
the unified Borg system maintained a robust balance of resources and efficiency,
it came at the cost of higher complexity. Moreover, Google made a significant
investment and drew on Google's workforce to establish Brog as a robust
infrastructure for containers, an option that is not readily visible to many
companies. Alternatively, public cloud offers easy-to-scale, offloading
infrastructure management. Hybrid cloud is another flexible option that
combines both on-premises and public cloud, extending its resources to the
public cloud.