Initialize Once, Start Fast: Application Initialization at Build Time.

March 01, 2020

Initialize Once, Start Fast: Application Initialization at Build Time.

1 INTRODUCTION

The recent shift to cloud computing and microservices has changed how server applications are

written and deployed: Instead of using large application servers, independent services are deployed

individually and executed on demand. When the load on a particular service increases, the underly-

ing platform (e.g., AWS Lambda [Amazon Web Services, Inc 2019], IBM Cloud Functions [IBM 2019],

Microsoft Azure Functions [Microsoft Azure 2019], Google Cloud Functions [Google Cloud Platform

2019], or Oracle Functions [Smith 2018]) spawns new workers to process the incoming workload.

The first request processed on the newly spawned worker (cold start) requires initialization of

the language runtime and initialization of the service configuration before the actual request is

processed.

Cold start of VMs can be unpredictable [Barrett et al. 2017] and an order of magnitude slower

than subsequent executions [Akkus et al. 2018], which can break the service-level agreement of

a cloud service. Some language platforms, such as Java VMs and .NET, are particularly slow at

startup because they perform expensive code verification, class loading, bytecode interpretation,

profiling, and dynamic compilation.

In this paper, we explore a novel approach for reducing the startup time and memory footprint:

combining points-to analysis, application initialization at build time, heap snapshotting, and ahead-

of-time (AOT) compilation. This preserves the benefits of the Java ecosystem: combining existing

libraries that are only loosely coupled and extend each other using, e.g., interfaces and declarative

configuration files. Our approach is based on a closed-world assumption, i.e., all Java classes must be

known and available at build time. At build time, the Java application and all its libraries (including

the JDK) are processed by the points-to analysis to find the reachable program elements (classes,

methods, and fields), so that only the necessary methods of Java classes are compiled.

Initialization code of the application can run at build time instead of at run time. The application

can control what is initialized and allocate Java objects to build complex data structures. These

objects are available at run time using a so-called image heap, a pre-initialized part of the heap that

is part of the executable and available immediately at application startup. The points-to analysis

can make objects reachable in the image heap, and the snapshotting that builds the image heap can

make new methods reachable for the points-to analysis. These steps are executed iteratively until a

fixed point is reached.

With our approach, only few initialization steps are executed at run time before the application’s

main method is invoked, for example, memory-mapping of the image heap. Multiple processes that

run the same executable, or multiple isolated VM instances within one process, use copy-on-write

sharing of the image heap. This reduces the overall memory footprint when multiple instances of

the same application are started.

We combine ideas from several areas of prior research: The concept of an image heap has

been used before by metacircular VMs [Alpern et al. 1999, 2005; Rogers and Grove 2009; Wimmer

et al. 2013]. Heap snapshotting has been used by Smalltalk [Ingalls 1983] and Self [Ungar 1995],

where development and deployment were not distinguishable and both code and data were kept in

snapshots. Points-to analysis [Hind 2001; Ryder 2003; Smaragdakis and Balatsouras 2015; Sridharan

et al. 2013] and AOT compilation are frequently applied and well-studied. The contribution of this

paper is the combination of these ideas: an iterative application of points-to analysis and heap

snapshotting.

We believe that our approach is more suitable for microservices than checkpoint/restore systems,

e.g., CRIU [CRIU 2019], that restore a Java VM such as the Java HotSpot VM: Restoring the Java

HotSpot VM from a checkpoint does not reduce the memory footprint that is systemic due to the

dynamic class loading and dynamic optimization approach, i.e., the memory that the Java HotSpot

VM needs for class metadata, Java bytecode, and dynamically compiled code. In addition, it cannot

rely on a points-to analysis to prune unnecessary parts of the application.

Our implementation, the Native Image component of GraalVM [Oracle 2019a], is written in Java

and Java is used for all examples in this paper, but our approach is not limited to Java or languages

that compile to Java bytecode. It can be applied to all managed languages that are amenable to

points-to analysis, such as C# or other languages of the .NET framework.

In summary, this paper contributes the following:

• We run parts of an application at build time and snapshot the objects allocated by this

initialization code, using an iterative approach that is intertwined with points-to analysis.

• We use points-to analysis results to only AOT-compile the parts of an application that are

reachable at run time.

• We allow multiple processes, and multiple isolated VM instances within one process, to share

the image heap using copy-on-write memory mapping.

• We measure the startup behavior and peak performance, and show that copy-on-write sharing

of the image heap has significant benefits.

2 SYSTEM OVERVIEW

This section presents the system structure and details for all of the build-time and run-time

components of GraalVM Native Image. The input of our system is Java bytecode, compiled from

any language that compiles to Java bytecode such as Java, Scala, or Kotlin. The application, its

libraries, the JDK, and VM components are all processed the same way. The result is a native

executable for a specific operating system and architecture. We call such an executable a native

image. Figure 1 shows the overall build-time system structure. First, points-to analysis and heap

snapshotting are executed iteratively until a fixed point is reached. Callbacks registered by the

application are also executed as part of this iterative approach, i.e., the application can participate.

The result of this stage is a list of reachable classes, methods, and fields; as well as an object graph

of reachable objects. Then, the reachable methods are compiled to machine code, and the object

graph is written out as the image heap in the same layout that is used at run time for the heap. The

machine code is stored in the text section of the native image, and the image heap is stored in the

data section.

We call the entire process that produces a native image the image build time, to clearly distinguish

it from the compilation of Java source code to bytecode (often called łcompile timež or łbuild timež).

The image builder is a Java application. The same Java VM that executes the points-to analysis and

AOT compilation also runs the class initializers and the initialization callbacks of the application.

This means that at image build time, objects that later form the image heap are ordinary objects in

the Java heap of the image builder. Heap snapshotting discovers the objects that are reachable for

the image heap.

One of the benefits of our approach is the blurred line between build time and run time: It is

easy to move initializations between build time and run time without large code changes, i.e., it is

possible to write an application that can support both configuration at build time and configuration

at run time. Section 4 presents a concrete case study of such an application.

2.1 Points-to Analysis

We use a points-to analysis [Smaragdakis and Balatsouras 2015; Sridharan et al. 2013] to determine

which classes, methods, and fields are reachable at run time. The points-to analysis starts with

all entry points, e.g., the main method of the application, and iteratively processes all transitively

reachable methods until a fixed point is reached. Our analysis is context insensitive, path sensitiv,

and flow insensitive for fields but flow sensitive for local variables because it is based on SSA

form [Hind 2001; Ryder 2003]. This subsection provides some background on our points-to analysis

implementation. Note that this paper does not contribute a novel points-to analysis approach.

The points-to analysis uses the front end of our compiler to parse Java bytecode into the compiler’s

high-level intermediate representation (IR). The IR is then converted to a so-called type-flow graph.

Nodes in the graph are instructions that operate on object types. Edges are directed use edges

between nodes, i.e., they point from the definition towards the usage. Each node maintains a type

state: a list of types that can reach this node and nullness-information. Type states are propagated

through the use edges: if the type state of a node is changed, the change is propagated to all usages.

Type states can only grow, i.e., new types can be added to a type state but types are never removed.

This ensures that the analysis eventually reaches a fixed point and terminates.

The type-flow graph is a single interprocedural graph that covers the whole application. For

static method calls, the caller and callee are linked when the type-flow graph is created: A use

edge connects an actual argument node with the matching formal argument node, i.e., types flow

into methods using argument nodes. The return node of the method has a use edge to the invocation

node in the caller, i.e., types flow out of methods using return nodes. For virtual and interface

method calls, the caller and callees are linked dynamically while the analysis is running: the type

state of the receiver determines which callees are reachable. When a new type is added to the type

state of the receiver, the new type is used to resolve the invocation. This ensures that types are

correctly propagated through virtual method calls, but only necessary callees are linked. For each

type, the points-to analysis tracks if the type is instantiated. Allocation bytecodes mark a type as

instantiated, and are one of the sources of types in the type-flow graph (type-flow nodes without a

predecessor). For each field, the points-to analysis tracks if the field is read or written. Reads and

writes are tracked separately to find fields that are only read but never written at run time. Such

fields are constant-folded later during AOT compilation.

2.2 Run Initialization Code

When the points-to analysis reaches a local fixed point (no more types added to type states),

initialization code is executed. It comes from two different sources:

First, class initializers are executed. In Java, every class can have a class initializer (sometimes

also called łstatic initializerž) that is represented as a method named <clinit> in the class file. It

computes the initial value of static fields. We allow the developer to decide which classes are

initialized at image build time, and which classes remain uninitialized at image build time and are

then initialized at run time. Section 3.1 provides more information on this policy choice.

Second, the application can register explicit callbacks that are invoked at build time. The applica-

tion can run custom code, e.g., before, during, and after the analysis stage, by implementing hooks

provided by our Feature interface. Figure 2 shows a fragment of this API.

Initialize Once, Start Fast: Application Initialization at Build Time 184:5

interface Feature {

// Hooks that are invoked at various stages at image build time.

void beforeAnalysis(BeforeAnalysisAccess access);

void duringAnalysis(DuringAnalysisAccess access);

void afterAnalysis(AfterAnalysisAccess access);

}

interface DuringAnalysisAccess {

// Access to the current points‐to analysis state.

boolean isReachable(Class clazz);

boolean isReachable(Field field);

boolean isReachable(Executable method);

}

Fig. 2. Fragment of the API to run application code at build time.

The relevant hook to run code before heap snapshotting is the łduring analysisž hook: it is exe-

cuted each time the points-to analysis has reached a local fixed point, but before the corresponding

heap snapshotting. At this time, the application can run custom code that, e.g., allocates objects

and initializes larger data structures. The important point to notice is that the initialization code

can access the current points-to analysis state: it can query if a type, method, or field was already

marked as reachable by the points-to analysis using the various isReachable() methods provided

by DuringAnalysisAccess. The application can use this information to build data structures that

are optimized for exactly the reachable parts of the application. Section 2.8 and the case study in

Section 4 present examples for running initializations at build time. These examples use the API

shown in Figure 2. The complete API is documented on the GraalVM homepage [Oracle 2019a].

2.3 Heap Snapshotting

Heap snapshotting builds an object graph, i.e., the transitive closure of reachable objects starting

with root pointers such as static fields. This object graph is written into the native image as the

image heap, i.e., the initial heap when the native image is started. Figure 3 shows the core algorithm.

The input of the algorithm is the state of the points-to analysis, pointsToState.

The root pointers of the image heap are static object fields that are marked by the points-to

analysis as read, as well as values that were constant folded into methods such as our image

singletons (see Section 3.5). These root values are first added to a worklist, and then the worklist is

processed until it is empty. Every object is only once added to the worklist and processed. For this,

a set with all reachable objects is maintained.

To build the transitive closure, object fields are followed by reading field values using reflection

(remember that the image builder is a Java application). Only instance fields that are marked as

read by the points-to analysis are considered, i.e., if a class has two instance fields but one of them

is not marked as read by the points-to analysis, the object reachable from that field is not part of

the image heap.

If the class of the fields’s value was not yet seen as a possible type of the field by the points-to

analysis, the class is registered as a field type. In such a case, the next run of the points-to analysis

propagates this new type to all reads of the field, and to all transitive usages in the type-flow graph.

Object arrays are handled in a similar way. In contrast to fields, the points-to analysis only

maintains a single set of types for object arrays (the Java type system does not properly distinguish

between arrays of different types). Types added to this set are propagated by the points-to analysis

through the type-flow graph.

2.4 Ahead-of-Time (AOT) Compilation

Only methods that are marked as reachable by the points-to analysis are AOT compiled to machine

code and placed in the text section of the executable. We use the GraalVM compiler, an established

compiler that compiles Java bytecode to machine code. The compiler is modular and can be

configured for different VMs and compilation scenarios such as dynamic compilation and AOT

compilation. However, the approach proposed in this paper is mostly independent from the compiler,

and adapting our compiler to use points-to analysis results for AOT compilation instead of profiling

information for dynamic compilation was not a significant amount of work.

The GraalVM compiler performs all standard optimizations such as method inlining, constant

folding and arithmetic optimizations, loop optimizations; and optimizations that are especially

effective for object-oriented Java code such as partial escape analysis [Stadler et al. 2014]. Speculative

compiler optimizations that require deoptimization are disabled for the AOT compilation. Instead,

the compiler incorporates points-to analysis results to improve the code quality: Fields that are

not marked as written at run time are constant folded, regardless whether they are declared as

final or not. Virtual and interface calls for which the points-to analysis found only a single callee

are de-virtualized to a direct call, which also enables inlining of the callee. Type checks where the

type state of the checked value only contains types that extend the checked type (or only types

that do not extend the checked type) are replaced with a constant result. Similarly, null checks are

removed when the points-to analysis marks a type state as never null.

Code that runs at image build time is often not reachable at run time and therefore not compiled.

For example, class initializers that are executed at image build time are not compiled (see Section 3.1).

However, it is allowed to have code that runs both at image build time and at run time, e.g., shared

utility methods.

2.5 Image Heap

Execution at run time starts with an already pre-populated Java heap that is constructed by the

heap snapshotting algorithm during image building: the image heap. Starting up the executable, or

creating a new isolated VM instance (called isolate; see Section 2.6) in an existing process, must be

fast and have a low memory overhead, so that many VM instances can be created for short-running

tasks. Therefore, there must not be an expensive relocation step when loading the image heap.

For isolates, it must be possible to map the image heap at different memory addresses in the same

address space.

To achieve these goals, we use references that are relative to the start of the image heap. This

means that memory accesses, like loading a field from an object or an array element, need more

address arithmetic than the usual absolute references: the heap start must be added to the reference

before the memory access. Objects of the image heap and objects allocated at run time use the same

format, i.e., also objects allocated at run time use relative references. To make this as fast as possible,

the heap start is always available in a fixed register (we use the register r14 on x64 architectures).

184:8 Wimmer, Stancu, Hofer, Jovanovic, Wögerer, Kessler, Pliss, Würthinger

Note that in many cases the addition can be folded into the memory access instruction on the x64

architecture, avoiding an explicit arithmetic operation.

With all object references being relative to the start of the image heap, the image heap that is

prepared during image building and is part of the native image can be memory-mapped multiple

times in the address space. This allows replicating the image heap without copying (fast isolate

creation) and copy-on-write sharing of the image heap (low memory overhead).

The image heap is immortal, i.e., the garbage collector (GC) treats the image heap as root pointers.

However, the GC only needs to scan a small part of the image heap for root pointers: Objects that

are identified as read-only by the points-to analysis can never have references to objects allocated

at run time and therefore contain no root pointers. Similarly, objects that contain only primitive

data, including but not limited to arrays of primitive types, cannot contain root pointers. The image

builder segregates objects in the image heap so that the GC only needs to scan a small, contiguous

part of the image heap. Apart from this grouping of objects, we currently do not perform any layout

optimization of the image heap.

2.6 Isolates

Isolates provide multiple independent VM instances in the same process. Instead of a single large

heap, the developer can split up the heap into smaller pieces. Creating an isolate creates a new

heap, with the image heap as the starting point. This means that all the initialization that is done

during image building is immediately available in every isolate. All isolates share the same AOT

compiled code, i.e., there is no separate points-to analysis and separate compilation per isolate.

Since code is immutable, the sharing of code is desirable

Because each isolate has a separate heap, it is not possible to have direct Java object references

between two isolates. This is a restriction and a benefit at the same time: The developer needs

to ensure that the object graph is completely partitioned, and it is, for example, not possible to

have a global cache that is accessible from all isolates. But the isolation allows garbage collection

independently in each isolate, without stopping or influencing other isolates. All memory allocated

by an isolate is freed automatically when the isolate is torn down, without the need for a garbage

collection.

Note that we neither advocate to use multiple isolates within one process nor to use multiple

processes with a single isolate each. It is up to the application developers to decide which model is

best for them. This flexibility is not offered by most Java VMs such as the Java HotSpot VM, which

only allows one VM instance per process.

2.7 Runtime Components

The native image contains a runtime system for Java, for example, a garbage collector (GC); stack

walking and exception handling; and support for threads and synchronization. All of our runtime

system is written in Java, discovered as reachable by the points-to analysis, and AOT compiled.

There is no distinction between VM components, the application, libraries used by the application,

and the standard Java library. As a result, the VM components also profit from AOT optimizations

based on points-to analysis results, and VM components can, e.g., be inlined into application code.

We use established techniques [Frampton et al. 2009; Wimmer et al. 2013] for writing low-level VM

code in Java.

2.8 Example: Service Loaders

The closed-world assumption of our approach requires that all Java classes, i.e., the application

and all its libraries, are known at image build time. However, this restriction must not impact how

developers structure their application and must not prevent modular programming.

Initialize Once, Start Fast: Application Initialization at Build Time 184:9

serviceimpl

MyServiceImpl.class

META-INF

serviceapi

MyService.class

serviceapi.MyService

serviceimpl.MyServiceImpl

ServiceDefinition.jar

ServiceImplementation.jar

service

Structure of classes and files expected by the Java ServiceLoader.

Java provides a standardized mechanism to load implementation classes of a specified service

interface by declaratively specifying the class name of the service implementation classes. The

implementation class names must be in a text file in the directory META-INF/services, and the file

name must be the name of the service interface. The service interface and the implementations are

usually in different .jar files, and likely implemented by different teams or even different companies.

Figure 4 shows how the service implementation class serviceimpl.MyServiceImpl that im-

plements the service interface serviceapi.MyService is registered. A text file in the direc-

tory META-INF/services named serviceapi.MyService contains one line, specifying the class

name serviceimpl.MyServiceImpl. The service implementation is loaded by the JDK class

ServiceLoader: calling the method load(MyService.class) returns a newly allocated instance

of the class MyServiceImpl.

Supporting service loaders in our system requires the iterative execution of points-to analysis and

heap snapshotting: Always including all registered service interfaces and implementation classes

in the native image would lead to an excessive amount of unnecessary code in the executable; but

not supporting service loading would remove support for a large class of real-world applications.

Because of the importance of service loaders, the approach described below is part of our standard

distribution. However, it could be implemented by a developer too, i.e., it is not a deeply integrated

part of our analysis but instead implemented as a Feature. Figure 5 shows a simplified version of

the algorithm.

Before the first run of the points-to analysis, nothing is done for service interfaces or service

implementation classes. Assume that the points-to analysis marks the service interface MyService

of our example as being used. After that first round of the points-to analysis, only the service

interface but not the service implementation class is marked as used. Now the Feature code to

support service loaders runs. It iterates all service interfaces, and checks each interface whether

it is marked as reachable via the API method isReachable(). This method returns true for the

interface MyService. Therefore the file named MyService in the META-INF/services directory

is loaded to retrieve the class names for the service implementations. In our example, this file

contains one line: serviceimpl.MyServiceImpl. This is the name of the service implementation

class. Using our reflection support (see Section 3.2), the class is registered for reflective lookup

(via the API method RuntimeReflection.register()) and the Constructor object for the pa-

rameterless constructor of the class is created and made available at run time (via the API method

RuntimeReflection.registerForReflectiveInstantiation()).

After the Feature.duringAnalysis() method, heap snapshotting is executed. The heap snap-

shotting sees the new Constructor object, which makes a new Constructor.newInstance()

method reachable. That method contains an allocation bytecode for the service implementation

class. The next run of the points-to analysis processes the new method, builds the type-flow graph.

3 IMPLEMENTATION DETAILS

This section presents details that are not novel contributions of the paper, but provide important

background information to understand how the system works in practice and to understand the

evaluation.

3.1 Class Initialization

Java provides for per-class initialization code: Before a class is used the first time (before the class

is instantiated, before a static method is called, or before a static field is accessed the first time),

the Java VM must execute the class initializer. In the Java class file, the class initializer is a method

with the known name <clinit>. In Java source code, the class initializer can be written explicitly

as a static { ... } block, but most classes have only implicit class initialization code: values

directly specified at the declaration site of static and static final fields.

Class initialization code often allocates objects or even large object graphs, and makes them

available via static fields. For most classes, the class initialization code does not depend on values

that change at run time. Therefore, it is desirable to execute class initializers by default at image

build time. The objects allocated by the class initializer are part of the image heap and therefore

available at run time without initialization cost. However, some class initializers depend on run-time

state. For example, a class initializer could query the current username or the current working

directory, which change between image build time and image run time.

We use a combination of automatic analysis and manual developer input to determine the class

initialization time: Class initializers that can be proven to not depend on run-time state are always

initialized at image build time. For the remaining classes, the developer can list classes that can be

initialized at build time. All other classes are initialized at run time.

A class marked for initialization at run time is initialized according to the Java specification, i.e.,

immediately before the first usage. The points-to analysis treats such a class initializer like any

other method that can be invoked. For example, it correctly marks static fields written by the

class initializer as łwritten at run timež, even if the field is declared as final.

3.2 Reflection

The Java reflection API provides class lookup by name, as well as access of the methods, constructors,

and fields of a class. Reflection complicates points-to analysis because it is not apparent which

classes, methods, and fields are accessed [Landman et al. 2017]. A conservative analysis that assumes

everything is accessed using reflection would defeat the goals of points-to analysis: we cannot

determine on a fine-grained level which classes, methods, and fields are used by an application if all

of them are possibly accessed using reflection. We require the developer to specify at image build

time which classes, methods, and fields should be visible to reflection. Only the listed elements are

then processed by the points-to analysis. To help collecting reflection information, we provide a

tool similar to [Bodden et al. 2011]: a tracing agent that records reflection usage of the application

either at development or build time.

The Java Native Interface (JNI) is handled in a similar way: JNI allows C code to look up classes,

methods, and fields by name and then invoke the methods or access the fields. The developer needs

to specify at image build time which elements are visible to JNI. This manual approach allows us to

support reflection and JNI.

3.3 Support for Java 8 Lambdas

Since Java 8, the Java programming language supports lambda expressions. To decouple the language

feature from implementation details, the Java bytecode generated for a lambda expression onlyprovides a high-level specification of the capture, but no actual implementation details. This is

achieved on the Java bytecode level by using the invokedynamic bytecode. Invokedynamic is a

version of a virtual call that does not have a target method fixed in the bytecode. Instead, the target

method is provided by a bootstrap method that is executed once when the invoke is reached the first

time. The bootstrap method for a lambda expression creates a new class, with bytecode assembled

by the bootstrap method using the ASM bytecode generation framework.

Our approach does not support generating new classes at run time, because the points-to analysis

needs to process all classes. Therefore, we force execution of the bootstrap method before the points-

to analysis. Since the bootstrap method is side-effect free apart from the newly produced class, this

does not change the behavior of the Java application. In summary, Java 8 lambda expressions are

fully supported by our approach.

Several other parts of Java that normally rely on bytecode generation are handled similarly. For ex-

ample, we force eager bytecode generation for proxy classes required by java.lang.reflect.Proxy.

Reference

Source www.christianwimmer.at

Search This Blog

Helmut Steiner

Initialize Once, Start Fast: Application Initialization at Build Time.

Comments

Post a Comment

Popular Posts

Java Fibres API Explained.

NoSQL Series - Distribution Models.