Initialize Once, Start Fast: Application Initialization at Build Time.
1 INTRODUCTION
The recent shift to cloud computing and microservices has changed how server applications are
written and deployed: Instead of using large application servers, independent services are deployed
individually and executed on demand. When the load on a particular service increases, the underly-
ing platform (e.g., AWS Lambda [Amazon Web Services, Inc 2019], IBM Cloud Functions [IBM 2019],
Microsoft Azure Functions [Microsoft Azure 2019], Google Cloud Functions [Google Cloud Platform
2019], or Oracle Functions [Smith 2018]) spawns new workers to process the incoming workload.
The first request processed on the newly spawned worker (cold start) requires initialization of
the language runtime and initialization of the service configuration before the actual request is
processed.
Cold start of VMs can be unpredictable [Barrett et al. 2017] and an order of magnitude slower
than subsequent executions [Akkus et al. 2018], which can break the service-level agreement of
a cloud service. Some language platforms, such as Java VMs and .NET, are particularly slow at
startup because they perform expensive code verification, class loading, bytecode interpretation,
profiling, and dynamic compilation.
In this paper, we explore a novel approach for reducing the startup time and memory footprint:
combining points-to analysis, application initialization at build time, heap snapshotting, and ahead-
of-time (AOT) compilation. This preserves the benefits of the Java ecosystem: combining existing
libraries that are only loosely coupled and extend each other using, e.g., interfaces and declarative
configuration files. Our approach is based on a closed-world assumption, i.e., all Java classes must be
known and available at build time. At build time, the Java application and all its libraries (including
the JDK) are processed by the points-to analysis to find the reachable program elements (classes,
methods, and fields), so that only the necessary methods of Java classes are compiled.
Initialization code of the application can run at build time instead of at run time. The application
can control what is initialized and allocate Java objects to build complex data structures. These
objects are available at run time using a so-called image heap, a pre-initialized part of the heap that
is part of the executable and available immediately at application startup. The points-to analysis
can make objects reachable in the image heap, and the snapshotting that builds the image heap can
make new methods reachable for the points-to analysis. These steps are executed iteratively until a
fixed point is reached.
With our approach, only few initialization steps are executed at run time before the application’s
main method is invoked, for example, memory-mapping of the image heap. Multiple processes that
run the same executable, or multiple isolated VM instances within one process, use copy-on-write
sharing of the image heap. This reduces the overall memory footprint when multiple instances of
the same application are started.
We combine ideas from several areas of prior research: The concept of an image heap has
been used before by metacircular VMs [Alpern et al. 1999, 2005; Rogers and Grove 2009; Wimmer
et al. 2013]. Heap snapshotting has been used by Smalltalk [Ingalls 1983] and Self [Ungar 1995],
where development and deployment were not distinguishable and both code and data were kept in
snapshots. Points-to analysis [Hind 2001; Ryder 2003; Smaragdakis and Balatsouras 2015; Sridharan
et al. 2013] and AOT compilation are frequently applied and well-studied. The contribution of this
paper is the combination of these ideas: an iterative application of points-to analysis and heap
snapshotting.
We believe that our approach is more suitable for microservices than checkpoint/restore systems,
e.g., CRIU [CRIU 2019], that restore a Java VM such as the Java HotSpot VM: Restoring the Java
HotSpot VM from a checkpoint does not reduce the memory footprint that is systemic due to the
dynamic class loading and dynamic optimization approach, i.e., the memory that the Java HotSpot
VM needs for class metadata, Java bytecode, and dynamically compiled code. In addition, it cannot
rely on a points-to analysis to prune unnecessary parts of the application.
Our implementation, the Native Image component of GraalVM [Oracle 2019a], is written in Java
and Java is used for all examples in this paper, but our approach is not limited to Java or languages
that compile to Java bytecode. It can be applied to all managed languages that are amenable to
points-to analysis, such as C# or other languages of the .NET framework.
In summary, this paper contributes the following:
• We run parts of an application at build time and snapshot the objects allocated by this
initialization code, using an iterative approach that is intertwined with points-to analysis.
• We use points-to analysis results to only AOT-compile the parts of an application that are
reachable at run time.
• We allow multiple processes, and multiple isolated VM instances within one process, to share
the image heap using copy-on-write memory mapping.
• We measure the startup behavior and peak performance, and show that copy-on-write sharing
of the image heap has significant benefits.
2 SYSTEM OVERVIEW
This section presents the system structure and details for all of the build-time and run-time
components of GraalVM Native Image. The input of our system is Java bytecode, compiled from
any language that compiles to Java bytecode such as Java, Scala, or Kotlin. The application, its
libraries, the JDK, and VM components are all processed the same way. The result is a native
executable for a specific operating system and architecture. We call such an executable a native
image. Figure 1 shows the overall build-time system structure. First, points-to analysis and heap
snapshotting are executed iteratively until a fixed point is reached. Callbacks registered by the
application are also executed as part of this iterative approach, i.e., the application can participate.
The result of this stage is a list of reachable classes, methods, and fields; as well as an object graph
of reachable objects. Then, the reachable methods are compiled to machine code, and the object
graph is written out as the image heap in the same layout that is used at run time for the heap. The
machine code is stored in the text section of the native image, and the image heap is stored in the
data section.
We call the entire process that produces a native image the image build time, to clearly distinguish
it from the compilation of Java source code to bytecode (often called łcompile timež or łbuild timež).
The image builder is a Java application. The same Java VM that executes the points-to analysis and
AOT compilation also runs the class initializers and the initialization callbacks of the application.
This means that at image build time, objects that later form the image heap are ordinary objects in
the Java heap of the image builder. Heap snapshotting discovers the objects that are reachable for
the image heap.
One of the benefits of our approach is the blurred line between build time and run time: It is
easy to move initializations between build time and run time without large code changes, i.e., it is
possible to write an application that can support both configuration at build time and configuration
at run time. Section 4 presents a concrete case study of such an application.
2.1 Points-to Analysis
We use a points-to analysis [Smaragdakis and Balatsouras 2015; Sridharan et al. 2013] to determine
which classes, methods, and fields are reachable at run time. The points-to analysis starts with
all entry points, e.g., the main method of the application, and iteratively processes all transitively
reachable methods until a fixed point is reached. Our analysis is context insensitive, path sensitiv,
and flow insensitive for fields but flow sensitive for local variables because it is based on SSA
form [Hind 2001; Ryder 2003]. This subsection provides some background on our points-to analysis
implementation. Note that this paper does not contribute a novel points-to analysis approach.
The points-to analysis uses the front end of our compiler to parse Java bytecode into the compiler’s
high-level intermediate representation (IR). The IR is then converted to a so-called type-flow graph.
Nodes in the graph are instructions that operate on object types. Edges are directed use edges
between nodes, i.e., they point from the definition towards the usage. Each node maintains a type
state: a list of types that can reach this node and nullness-information. Type states are propagated
through the use edges: if the type state of a node is changed, the change is propagated to all usages.
Type states can only grow, i.e., new types can be added to a type state but types are never removed.
This ensures that the analysis eventually reaches a fixed point and terminates.
The type-flow graph is a single interprocedural graph that covers the whole application. For
static method calls, the caller and callee are linked when the type-flow graph is created: A use
edge connects an actual argument node with the matching formal argument node, i.e., types flow
into methods using argument nodes. The return node of the method has a use edge to the invocation
node in the caller, i.e., types flow out of methods using return nodes. For virtual and interface
method calls, the caller and callees are linked dynamically while the analysis is running: the type
state of the receiver determines which callees are reachable. When a new type is added to the type
state of the receiver, the new type is used to resolve the invocation. This ensures that types are
correctly propagated through virtual method calls, but only necessary callees are linked. For each
type, the points-to analysis tracks if the type is instantiated. Allocation bytecodes mark a type as
instantiated, and are one of the sources of types in the type-flow graph (type-flow nodes without a
predecessor). For each field, the points-to analysis tracks if the field is read or written. Reads and
writes are tracked separately to find fields that are only read but never written at run time. Such
fields are constant-folded later during AOT compilation.
2.2 Run Initialization Code
When the points-to analysis reaches a local fixed point (no more types added to type states),
initialization code is executed. It comes from two different sources:
First, class initializers are executed. In Java, every class can have a class initializer (sometimes
also called łstatic initializerž) that is represented as a method named <clinit> in the class file. It
computes the initial value of static fields. We allow the developer to decide which classes are
initialized at image build time, and which classes remain uninitialized at image build time and are
then initialized at run time. Section 3.1 provides more information on this policy choice.
Second, the application can register explicit callbacks that are invoked at build time. The applica-
tion can run custom code, e.g., before, during, and after the analysis stage, by implementing hooks
provided by our Feature interface. Figure 2 shows a fragment of this API.
Initialize Once, Start Fast: Application Initialization at Build Time 184:5
interface Feature {
// Hooks that are invoked at various stages at image build time.
void beforeAnalysis(BeforeAnalysisAccess access);
void duringAnalysis(DuringAnalysisAccess access);
void afterAnalysis(AfterAnalysisAccess access);
}
interface DuringAnalysisAccess {
// Access to the current points‐to analysis state.
boolean isReachable(Class clazz);
boolean isReachable(Field field);
boolean isReachable(Executable method);
}
Fig. 2. Fragment of the API to run application code at build time.
The relevant hook to run code before heap snapshotting is the łduring analysisž hook: it is exe-
cuted each time the points-to analysis has reached a local fixed point, but before the corresponding
heap snapshotting. At this time, the application can run custom code that, e.g., allocates objects
and initializes larger data structures. The important point to notice is that the initialization code
can access the current points-to analysis state: it can query if a type, method, or field was already
marked as reachable by the points-to analysis using the various isReachable() methods provided
by DuringAnalysisAccess. The application can use this information to build data structures that
are optimized for exactly the reachable parts of the application. Section 2.8 and the case study in
Section 4 present examples for running initializations at build time. These examples use the API
shown in Figure 2. The complete API is documented on the GraalVM homepage [Oracle 2019a].
2.3 Heap Snapshotting
Heap snapshotting builds an object graph, i.e., the transitive closure of reachable objects starting
with root pointers such as static fields. This object graph is written into the native image as the
image heap, i.e., the initial heap when the native image is started. Figure 3 shows the core algorithm.
The input of the algorithm is the state of the points-to analysis, pointsToState.
The root pointers of the image heap are static object fields that are marked by the points-to
analysis as read, as well as values that were constant folded into methods such as our image
singletons (see Section 3.5). These root values are first added to a worklist, and then the worklist is
processed until it is empty. Every object is only once added to the worklist and processed. For this,
a set with all reachable objects is maintained.
To build the transitive closure, object fields are followed by reading field values using reflection
(remember that the image builder is a Java application). Only instance fields that are marked as
read by the points-to analysis are considered, i.e., if a class has two instance fields but one of them
is not marked as read by the points-to analysis, the object reachable from that field is not part of
the image heap.
If the class of the fields’s value was not yet seen as a possible type of the field by the points-to
analysis, the class is registered as a field type. In such a case, the next run of the points-to analysis
propagates this new type to all reads of the field, and to all transitive usages in the type-flow graph.
Object arrays are handled in a similar way. In contrast to fields, the points-to analysis only
maintains a single set of types for object arrays (the Java type system does not properly distinguish
between arrays of different types). Types added to this set are propagated by the points-to analysis
through the type-flow graph.
2.4 Ahead-of-Time (AOT) Compilation
Only methods that are marked as reachable by the points-to analysis are AOT compiled to machine
code and placed in the text section of the executable. We use the GraalVM compiler, an established
compiler that compiles Java bytecode to machine code. The compiler is modular and can be
configured for different VMs and compilation scenarios such as dynamic compilation and AOT
compilation. However, the approach proposed in this paper is mostly independent from the compiler,
and adapting our compiler to use points-to analysis results for AOT compilation instead of profiling
information for dynamic compilation was not a significant amount of work.
The GraalVM compiler performs all standard optimizations such as method inlining, constant
folding and arithmetic optimizations, loop optimizations; and optimizations that are especially
effective for object-oriented Java code such as partial escape analysis [Stadler et al. 2014]. Speculative
compiler optimizations that require deoptimization are disabled for the AOT compilation. Instead,
the compiler incorporates points-to analysis results to improve the code quality: Fields that are
not marked as written at run time are constant folded, regardless whether they are declared as
final or not. Virtual and interface calls for which the points-to analysis found only a single callee
are de-virtualized to a direct call, which also enables inlining of the callee. Type checks where the
type state of the checked value only contains types that extend the checked type (or only types
that do not extend the checked type) are replaced with a constant result. Similarly, null checks are
removed when the points-to analysis marks a type state as never null.
Code that runs at image build time is often not reachable at run time and therefore not compiled.
For example, class initializers that are executed at image build time are not compiled (see Section 3.1).
However, it is allowed to have code that runs both at image build time and at run time, e.g., shared
utility methods.
2.5 Image Heap
Execution at run time starts with an already pre-populated Java heap that is constructed by the
heap snapshotting algorithm during image building: the image heap. Starting up the executable, or
creating a new isolated VM instance (called isolate; see Section 2.6) in an existing process, must be
fast and have a low memory overhead, so that many VM instances can be created for short-running
tasks. Therefore, there must not be an expensive relocation step when loading the image heap.
For isolates, it must be possible to map the image heap at different memory addresses in the same
address space.
To achieve these goals, we use references that are relative to the start of the image heap. This
means that memory accesses, like loading a field from an object or an array element, need more
address arithmetic than the usual absolute references: the heap start must be added to the reference
before the memory access. Objects of the image heap and objects allocated at run time use the same
format, i.e., also objects allocated at run time use relative references. To make this as fast as possible,
the heap start is always available in a fixed register (we use the register r14 on x64 architectures).
184:8 Wimmer, Stancu, Hofer, Jovanovic, Wögerer, Kessler, Pliss, Würthinger
Note that in many cases the addition can be folded into the memory access instruction on the x64
architecture, avoiding an explicit arithmetic operation.
With all object references being relative to the start of the image heap, the image heap that is
prepared during image building and is part of the native image can be memory-mapped multiple
times in the address space. This allows replicating the image heap without copying (fast isolate
creation) and copy-on-write sharing of the image heap (low memory overhead).
The image heap is immortal, i.e., the garbage collector (GC) treats the image heap as root pointers.
However, the GC only needs to scan a small part of the image heap for root pointers: Objects that
are identified as read-only by the points-to analysis can never have references to objects allocated
at run time and therefore contain no root pointers. Similarly, objects that contain only primitive
data, including but not limited to arrays of primitive types, cannot contain root pointers. The image
builder segregates objects in the image heap so that the GC only needs to scan a small, contiguous
part of the image heap. Apart from this grouping of objects, we currently do not perform any layout
optimization of the image heap.
2.6 Isolates
Isolates provide multiple independent VM instances in the same process. Instead of a single large
heap, the developer can split up the heap into smaller pieces. Creating an isolate creates a new
heap, with the image heap as the starting point. This means that all the initialization that is done
during image building is immediately available in every isolate. All isolates share the same AOT
compiled code, i.e., there is no separate points-to analysis and separate compilation per isolate.
Since code is immutable, the sharing of code is desirable
Because each isolate has a separate heap, it is not possible to have direct Java object references
between two isolates. This is a restriction and a benefit at the same time: The developer needs
to ensure that the object graph is completely partitioned, and it is, for example, not possible to
have a global cache that is accessible from all isolates. But the isolation allows garbage collection
independently in each isolate, without stopping or influencing other isolates. All memory allocated
by an isolate is freed automatically when the isolate is torn down, without the need for a garbage
collection.
Note that we neither advocate to use multiple isolates within one process nor to use multiple
processes with a single isolate each. It is up to the application developers to decide which model is
best for them. This flexibility is not offered by most Java VMs such as the Java HotSpot VM, which
only allows one VM instance per process.
2.7 Runtime Components
The native image contains a runtime system for Java, for example, a garbage collector (GC); stack
walking and exception handling; and support for threads and synchronization. All of our runtime
system is written in Java, discovered as reachable by the points-to analysis, and AOT compiled.
There is no distinction between VM components, the application, libraries used by the application,
and the standard Java library. As a result, the VM components also profit from AOT optimizations
based on points-to analysis results, and VM components can, e.g., be inlined into application code.
We use established techniques [Frampton et al. 2009; Wimmer et al. 2013] for writing low-level VM
code in Java.
2.8 Example: Service Loaders
The closed-world assumption of our approach requires that all Java classes, i.e., the application
and all its libraries, are known at image build time. However, this restriction must not impact how
developers structure their application and must not prevent modular programming.
Initialize Once, Start Fast: Application Initialization at Build Time 184:9
serviceimpl
MyServiceImpl.class
META-INF
serviceapi
MyService.class
serviceapi.MyService
serviceimpl.MyServiceImpl
ServiceDefinition.jar
ServiceImplementation.jar
service
Structure of classes and files expected by the Java ServiceLoader.
Java provides a standardized mechanism to load implementation classes of a specified service
interface by declaratively specifying the class name of the service implementation classes. The
implementation class names must be in a text file in the directory META-INF/services, and the file
name must be the name of the service interface. The service interface and the implementations are
usually in different .jar files, and likely implemented by different teams or even different companies.
Figure 4 shows how the service implementation class serviceimpl.MyServiceImpl that im-
plements the service interface serviceapi.MyService is registered. A text file in the direc-
tory META-INF/services named serviceapi.MyService contains one line, specifying the class
name serviceimpl.MyServiceImpl. The service implementation is loaded by the JDK class
ServiceLoader: calling the method load(MyService.class) returns a newly allocated instance
of the class MyServiceImpl.
Supporting service loaders in our system requires the iterative execution of points-to analysis and
heap snapshotting: Always including all registered service interfaces and implementation classes
in the native image would lead to an excessive amount of unnecessary code in the executable; but
not supporting service loading would remove support for a large class of real-world applications.
Because of the importance of service loaders, the approach described below is part of our standard
distribution. However, it could be implemented by a developer too, i.e., it is not a deeply integrated
part of our analysis but instead implemented as a Feature. Figure 5 shows a simplified version of
the algorithm.
Before the first run of the points-to analysis, nothing is done for service interfaces or service
implementation classes. Assume that the points-to analysis marks the service interface MyService
of our example as being used. After that first round of the points-to analysis, only the service
interface but not the service implementation class is marked as used. Now the Feature code to
support service loaders runs. It iterates all service interfaces, and checks each interface whether
it is marked as reachable via the API method isReachable(). This method returns true for the
interface MyService. Therefore the file named MyService in the META-INF/services directory
is loaded to retrieve the class names for the service implementations. In our example, this file
contains one line: serviceimpl.MyServiceImpl. This is the name of the service implementation
class. Using our reflection support (see Section 3.2), the class is registered for reflective lookup
(via the API method RuntimeReflection.register()) and the Constructor object for the pa-
rameterless constructor of the class is created and made available at run time (via the API method
RuntimeReflection.registerForReflectiveInstantiation()).
After the Feature.duringAnalysis() method, heap snapshotting is executed. The heap snap-
shotting sees the new Constructor object, which makes a new Constructor.newInstance()
method reachable. That method contains an allocation bytecode for the service implementation
class. The next run of the points-to analysis processes the new method, builds the type-flow graph.
3 IMPLEMENTATION DETAILS
This section presents details that are not novel contributions of the paper, but provide important
background information to understand how the system works in practice and to understand the
evaluation.
3.1 Class Initialization
Java provides for per-class initialization code: Before a class is used the first time (before the class
is instantiated, before a static method is called, or before a static field is accessed the first time),
the Java VM must execute the class initializer. In the Java class file, the class initializer is a method
with the known name <clinit>. In Java source code, the class initializer can be written explicitly
as a static { ... } block, but most classes have only implicit class initialization code: values
directly specified at the declaration site of static and static final fields.
Class initialization code often allocates objects or even large object graphs, and makes them
available via static fields. For most classes, the class initialization code does not depend on values
that change at run time. Therefore, it is desirable to execute class initializers by default at image
build time. The objects allocated by the class initializer are part of the image heap and therefore
available at run time without initialization cost. However, some class initializers depend on run-time
state. For example, a class initializer could query the current username or the current working
directory, which change between image build time and image run time.
We use a combination of automatic analysis and manual developer input to determine the class
initialization time: Class initializers that can be proven to not depend on run-time state are always
initialized at image build time. For the remaining classes, the developer can list classes that can be
initialized at build time. All other classes are initialized at run time.
A class marked for initialization at run time is initialized according to the Java specification, i.e.,
immediately before the first usage. The points-to analysis treats such a class initializer like any
other method that can be invoked. For example, it correctly marks static fields written by the
class initializer as łwritten at run timež, even if the field is declared as final.
3.2 Reflection
The Java reflection API provides class lookup by name, as well as access of the methods, constructors,
and fields of a class. Reflection complicates points-to analysis because it is not apparent which
classes, methods, and fields are accessed [Landman et al. 2017]. A conservative analysis that assumes
everything is accessed using reflection would defeat the goals of points-to analysis: we cannot
determine on a fine-grained level which classes, methods, and fields are used by an application if all
of them are possibly accessed using reflection. We require the developer to specify at image build
time which classes, methods, and fields should be visible to reflection. Only the listed elements are
then processed by the points-to analysis. To help collecting reflection information, we provide a
tool similar to [Bodden et al. 2011]: a tracing agent that records reflection usage of the application
either at development or build time.
The Java Native Interface (JNI) is handled in a similar way: JNI allows C code to look up classes,
methods, and fields by name and then invoke the methods or access the fields. The developer needs
to specify at image build time which elements are visible to JNI. This manual approach allows us to
support reflection and JNI.
3.3 Support for Java 8 Lambdas
Since Java 8, the Java programming language supports lambda expressions. To decouple the language
feature from implementation details, the Java bytecode generated for a lambda expression onlyprovides a high-level specification of the capture, but no actual implementation details. This is
achieved on the Java bytecode level by using the invokedynamic bytecode. Invokedynamic is a
version of a virtual call that does not have a target method fixed in the bytecode. Instead, the target
method is provided by a bootstrap method that is executed once when the invoke is reached the first
time. The bootstrap method for a lambda expression creates a new class, with bytecode assembled
by the bootstrap method using the ASM bytecode generation framework.
Our approach does not support generating new classes at run time, because the points-to analysis
needs to process all classes. Therefore, we force execution of the bootstrap method before the points-
to analysis. Since the bootstrap method is side-effect free apart from the newly produced class, this
does not change the behavior of the Java application. In summary, Java 8 lambda expressions are
fully supported by our approach.
Several other parts of Java that normally rely on bytecode generation are handled similarly. For ex-
ample, we force eager bytecode generation for proxy classes required by java.lang.reflect.Proxy.
Reference
Source www.christianwimmer.at
Comments
Post a Comment