The Systems group at Cornell examines the design and implementation of the fundamental software systems that form our computing infrastructure. Below we give just a small representation of the varied systems work going on here, and invite you to visit the project and faculty web pages, as well as read our papers.

Cloud Computing

cloud diagram

The increasing importance of cloud infrastructures, cloud computing and containerization are leading organizations to move growing portions of their enterprise activities out of their own data centers and into the cloud. Hakim Weatherspoon and Robbert van Renesse have been collaborating on improvements to the cloud paradigm.  A key enabler here is nested virtualization: the ability to run a virtual machine inside another virtual machine.  Nested virtualization gives the user significant powers.  For example, it allows a user to migrate their virtual machines between cloud providers, or consolidate multiple lightly loaded virtual machines in a single virtual machine dynamically.  Also, as virtual machines often run a single application, access to the virtual machine monitor makes it possible to take down the kernel/user space barrier, leading to significant improvements in performance.

Datacenter applications consist of many communicating components and evolve organically as requirements develop over time.  Lorenzo Alvisi and Robbert van Renesse are working on two projects that try to support such organic growth.  The first project, Escher, recognizes that components of a distributed systems may themselves be distributed systems.  Escher introduces a communication abstraction that hides the internals of a distributed component, and in particular how to communicate with it, from other components.  Using Escher, a replicated server can invoke another replicated server without either server having to even know that the servers are replicated.  The second project, Ziplog, presents a datacenter-scale totally ordered logging service.  Logs are increasingly a central component in many datacenter applications, but existing totally ordered log implementations make a trade-off between scale and latency, while log reconfigurations can lead to significant hiccups in the performance of those applications.  Ziplog provides large scale and ultra-low latencies and supports seamless reconfiguration operations that allow it to scale up and down without any downtime.

Ken Birman's group spent the last few years developing a new library for building highly assured cloud computing systems that use replication and coordinate actions; he calls it Derecho and it achieves blinding speeds by leveraging RDMA hardware -- in fact it is faster than standard systems even over standard TCP protocols.  The core of Derecho gains speed not just by mapping efficiently to fast networking, although that is definitely one aspect of the advantage: The other is that Derecho as a whole uses versions of state machine replication (Paxos) that have been "refactored" with ultra-high bandwidths (and fairly low latencies) in mind.

In newer work, Ken's group is using Derecho to create Cascade: a novel key-value store (KVS) for low-latency, high bandwidth machine learning and artificial intelligence.  Cascade is a platform that hosts AI and ML logic directly in its address space, enabling zero-copy lock-free data access, so that the instant data arrives it can trigger the desired ML or AI computation.  Data is cached in host memory or in the GPU for additional speed, and unlike a standard KVS, Cascade is able to plan both data placement and task placement to optimize for high resource utilizations with ultra-low latency.  Early uses of Cascade include an industrial IIoT platform that leverages the Siemens Mendix product as a dashboard but integrates tightly with Cascade's fast data paths, digital agriculture work aimed at smart farms or larger-scale regional planning, onboard flight applications for fly-by-wire aircraft and smart power grid control.   Down the road, our thinking is to expand into 5G mobility, by extending Cascade into a new kind of "edge cloud" for IoT intelligence.

For summaries of additional cloud computing research at Cornell see cloudcomputing.cornell.edu.

Distributed Systems and Fault Tolerance

kenbook

Cornell is particularly well-known for its foundational and practical work on fault-tolerant distributed systems. Ken Birman's book on reliable distributed systems is widely used in classrooms and industry. His Isis toolkit system was used extensively in industry for building fault-tolerant systems for decades. Fred Schneider's oft-referenced and ACM Hall of Fame award-winning State Machine Replication tutorial is standard fare in systems courses around the world. Van Renesse and Schneider invented and analyzed the Chain Replication paradigm, which is now used by several large Internet services including Microsoft Azure storage.

Fred Schneider and Robbert van Renesse are currently collaborating on a new version of Chain Replication that is able to self-configure.  If successful, it will be the first replication protocol for fail-stop protocols that is able to to self-configure. Greg Morrisett and Robbert van Renesse are collaborating on Byzantine replication protocols and blockchain protocols that are provably correct.

Networking

websm

Nate Foster, in cooperation with researchers at Princeton, has been developing high-level languages, such as Frenetic, for programming distributed collections of enterprise network switches (ICFP 2011, HotNets 2011, POPL 2012). The languages allow modular reasoning about network properties. Ken Birman, Robbert van Renesse, Hakim Weatherspoon, and their students are working with researchers from other academia and industry on the Nebula project, whose goal is to address threats to the cloud while meeting the challenges of flexibility, extensibility and economic viability (IEEE Internet Computing 2011). One artifact that came out of this work is TCPR, a tool that fault-tolerant applications can use to recover their TCP connections after crashing or migrating; it masks the application failure and enables transparent recovery, so the remote peers remain unaware. Another artifact under development is SoNIC (Software-defined Network Interface Card), which provides precise and reproducible measurements of an optical lambda network. By achieving extremely high levels of precision, SoNIC can shed light on the complexities of flows that traverse high-speed networks (IMC 2010, DSN 2010).  Birman and Van Renesse collaborated with Cisco to create a high availability option for the Cisco CRS-1 backbone network routers (DSN 13).

Cross-Cutting Research Impact

Besides the topics mentioned above, the systems faculty is also actively involved with cross-cutting research involving Security, Programming Languages, Computer Architecture, and Theory. Many Cornell systems have been adopted by industry, and companies such as Microsoft, Google, Amazon and Facebook all use Cornell software or algorithms in diverse ways.  Indeed, the CTO at Amazon, Werner Vogels, joined that company directly from PhD research at Cornell, and the Chief Scientist of Microsoft Azure Global, Ranveer Chandra, also has a Cornell PhD.  Chandra also heads Microsoft Network research and IoT Edge computing.  And Johannes Gherke, previously a Cornell faculty member, is head of research for all of Microsoft Redmond.  Other Cornell students went on to take leadership roles at Facebook, revamping that company's entire content hosting and distributed systems architecture, and Ulfar Errlingsson was head of security research at Google for nearly a decade.  Yaron Minsky, a Cornell systems PhD, leads a major Wall Street computing infrastructure group at Jane Street.

Environment

The Systems Group at Cornell prides itself on its collegial internal environment. Our Systems Lunches, where professors and graduate students get together every Friday to have lunch and discuss recent, cutting-edge papers in the field, draws an attendance of 40–60 people and has been adopted by many other institutions. And Cornell's Systems Lab, a large collaborative space with wall-to-wall whiteboards, projectors, sound systems and work areas for up to three dozen people, has served as a crucible where people hack together on projects and design new systems.