Jan de Mooij

Some SpiderMonkey optimizations in Firefox Quantum

2017-12-06T00:00:00+00:00

A few weeks ago we released Firefox 57, also known as Firefox Quantum. I work on SpiderMonkey performance and this year I spent a lot of time analyzing profiles and optimizing as many things as possible.</p>

I can't go through all SpiderMonkey performance improvements here: this year alone I landed almost 500 patches (most of them performance related, most of them pretty boring) and other SpiderMonkey hackers have done excellent work in this area as well. So I just decided to focus on some of the more interesting optimizations I worked on in 2017. Most of these changes landed in Firefox 55-57, some of them will ship in Firefox 58 or 59.</p>

shift/unshift optimizations</h2>

Array.prototype.shift</a> removes the first element from an array. We used to implement this by moving all other elements in memory, so shift() had O(n) performance. When JS code uses arrays as queues, it was easy to get quadratic behavior (outlook.com did something like this</a>):</p>

while </span>(</span>arr</span>.length > </span>0</span>)
</span>    </span>foo</span>(</span>arr</span>.</span>shift</span>());
</span></code></pre>
The situation was even worse</a> for us when we were in the middle of an incremental GC (due to pre-barriers).</p>
Instead of moving the elements in memory, we can now use</a> pointer arithmetic to make the object point to the second element (with some bookkeeping for the GC) and this is an order of magnitude faster when there are many elements. I also optimized unshift</a> and splice</a> to take advantage of this shifted-elements optimization. For instance, unshift can now reserve space for extra elements at the start of the array, so subsequent calls to unshift will be very fast. </p>
While working on this I noticed some other engines have a similar optimization, but it doesn't always work (for example when the array has many elements). In SpiderMonkey, the shifted-elements optimization fits in very well architecturally and we don't have such performance cliffs (as far as I know!).</p>
Regular expressions</h2>
RegExp objects can now be nursery allocated</a>. We can now also allocate them directly from JIT code</a>. These changes improved some benchmarks a lot (the orange line is Firefox):</p>
</p>
While working on this, I also moved the RegExpShared table</a> from the compartment to the zone: when multiple iframes use the same regular expression, we will now only parse and compile it once (this matters for ads, Facebook like buttons, etc). I also fixed</a> a performance bug with regular expressions and interrupts: we would sometimes execute regular expressions in the (slow!) regex interpreter instead of running the (much faster) regex JIT code.</p>
Finally, the RegExp constructor could waste a lot of time checking the pattern syntax. I noticed this when I was profiling real-world code, but the fix for this</a> also happened to double our Dromaeo object-regexp score :)</p>
</p>
Inline Caches</h2>
This year we finished converting our most important ICs (for getting/setting properties) to CacheIR</a>, our new IC architecture. This allowed us to optimize more things, here are a few:</p>

I rewrote</a> our IC heuristics. We now have special stubs for megamorphic lookups.</li>
We now optimize more property gets</a> and sets</a> on DOM proxies like document</code> or NodeLists</code>. We had some longstanding bugs here that were much easier to fix with our new tools.</li>
Similarly, we now</a> optimize more</a> property accesses on WindowProxy</code> (things like this.foo</code> or window.foo</code> in the browser).</li>
Our IC stubs for adding slots</a> and adding elements</a> now support (re)allocating new slots/elements.</li>
We can now use</a> ICs in more cases</a>.</li>
</ul>
The work on CacheIR has really paid off this year: we were able to remove many lines of code while also improving IC coverage and performance a lot.</p>
Property addition</h2>
Adding new properties to objects used to be much slower than necessary. I landed</a> at least</a> 20 patches</a> to optimize this.</p>
SpiderMonkey used to support slotful accessor properties (data properties with a getter/setter) and this complicated our object layout a lot. To get rid of this, I first had to remove</a> the internal getProperty and setProperty Class hooks, this turned out to be pretty complicated because I had to fix some ancient code we have in Gecko where we relied on these hooks, from NPAPI code</a> to js-ctypes</a> to XPConnect</a>.</p>
After that I was able to remove</a> slotful accessor properties and simplify</a> a lot of code. This allowed us to optimize</a> our property addition/lookup code even more: for instance, we now have separate methods for adding data vs accessor properties. This was impossible to do before because there was simply no clear distinction between data and accessor properties.</p>
Property iteration</h2>
Property iteration via for-in or Object.keys is pretty common, so I spent some time optimizing this. We used to have some caches that were fine for micro-benchmarks (read: SunSpider), but didn't work very well on real-world code. I optimized the for-in code</a>, rewrote the iterator cache, and added an IC for this. For-in performance should be much better now. </p>
I also rewrote the enumeration code</a> used by both for-in, Object.getOwnPropertyNames, etc to be much faster and simpler.</p>
MinorGC triggers</h2>
In Firefox, when navigating to another page, we have a mechanism to "nuke" chrome -> content wrappers to prevent bad memory leaks</a>. The code for this used to trigger a minor GC to evict the GC's nursery, in case there were nursery-allocated wrappers. These GCs showed up in profiles and it turned out that most of these evict-nursery calls were unnecessary, so I fixed this</a>.</p>
According to Telemetry</a>, this small patch eliminated tons</em> of unnecessary minor GCs in the browser:</p>
</p>
The black line shows most of our minor GCs (69%) were EVICT_NURSERY GCs and afterwards (the orange line) this just doesn't show up anymore. We now have other minor GC reasons that are more common and expected (full nursery, full store buffer, etc).</p>
Proxies</h2>
After refactoring</a> our object allocation code to be faster and simpler, it was easy to optimize proxy objects: we now allocate ProxyValueArray inline</a> instead of requiring a malloc for each proxy.</p>
Proxies can also have an arbitrary slot layout</a> now (a longstanding request from our DOM team). Accessing certain slots on DOM objects is now faster than before and I was able to shrink many of</a> our proxies (before these changes, all proxy objects had at least 3-4 Value slots, even though most proxies need only 1 or 2 slots).</p>
Builtins</h2>
A lot of builtin functions were optimized. Here are just a few of them:</p>

I fixed some</a> bad performance cliffs</a> that affected various Array functions.</li>
I ported Object.assign to C++</a>. It now uses less memory (we used to allocate an array for the property names) and in a lot of cases is much faster than before.</li>
I optimized</a> Function.prototype.toString. It's surprisingly common for websites to stringify the same function repeatedly so we now have a FunctionToString cache for this.</li>
Object.prototype.toString is very hot and I optimized</a> it a number</a> of times. We can also inline</a> it in the JIT now and I added a new optimization</a> for lookups of the toStringTag/toPrimitive Symbols.</li>
Array.isArray is now inlined in JIT code in a lot more</a> cases.</li>
</ul>
Other optimizations</h2>

We unnecessarily delazified (triggering full parsing of) thousands of functions</a> when loading Gmail.</li>
Babel generates code that mutates __proto__</code> and this used to deoptimize a lot of things. I fixed a number of issues</a> in this area.</li>
Cycle detection (for instance for JSON.stringify and Array.prototype.join) now uses a Vector</a> instead of a HashSet. This is much faster in the common cases (and not that much slower in pathological cases).</li>
I devirtualized some of our hottest virtual functions in the frontend</a> and in our</a> Ion JIT backend</a>.</li>
</ul>
Conclusion</h2>
SpiderMonkey performance has improved tremendously the past months and we're not stopping here. Hopefully there will be a lot more of this in 2018 :) If you find some real-world JS code that's much slower in Firefox than in other browsers, please let us know. Usually when we're significantly slower than other browsers it's because we're doing something silly and most of these bugs are not that hard to fix once we are aware of them.</p>



CacheIR: A new approach to Inline Caching in Firefox
2017-01-25T00:00:00+00:00
The past months we have been working on CacheIR</a>, an overhaul of the IC code in SpiderMonkey's JITs. CacheIR has allowed us to remove thousands of lines of boilerplate and code duplication. It has also made it much easier to add new optimizations. This post describes how CacheIR works, why it's much better than what we had before, and some plans we have for the future.</p>
IC stubs</h2>
Both the Baseline JIT and IonMonkey optimizing JIT can use Inline Cache</a> stubs to optimize things like property accesses. Each IC has a linked list of stubs. When we want to optimize a certain operation, like object.foo</code>, the IC adds a new stub for this particular case to the list. Stubs usually have one or more guards and if a guard fails we jump to the next stub in the list.</p>
CacheIR</h2>
CacheIR is a very simple bytecode that makes it easy to generate new IC stubs to optimize particular cases. Consider the following JS code:</p>
var </span>point </span>= {x: </span>1</span>, y: </span>2</span>};
</span>var </span>x </span>= </span>point</span>.x;
</span></code></pre>
The GetProp IC we use for the point.x</code> lookup uses the GetPropIRGenerator to emit CacheIR for this lookup. The IR we emit for it looks like this:</p>
GuardIsObject Op0
</span>GuardShape Op0 Field0
</span>LoadFixedSlotResult Op0 Field1
</span></code></pre>
As you can see, CacheIR is a very simple, linear IR. It has no explicit branches, no loops, just guards. This makes it easy to compile CacheIR to machine code.</p>
Here Op0 is the IC's input operand (point</code> in this case). We first guard it's an object, then we guard on its Shape (an object's Shape determines its layout and properties, other engines may use the terms 'map' or 'hidden class'), and then we load a value from the object's fixed slot into the IC's output register.</p>
In addition to the IR, the IR emitter also generates a list of stub fields</em>. Field0 and Field1 above refer to fields in the stub data, in this case we have the following fields:</p>
Field 0: Shape* pointer: 0xeb39f900
</span>Field 1: offset: 16
</span></code></pre>
The IR itself does not contain any pointers or slot offsets, we will see below why this matters.</p>
A GetProp IC has a single input value, but other ICs like GetElem (used for x[y]</code> in JS) can have multiple inputs:</p>
var </span>prop </span>= </span>"x"</span>;
</span>var </span>x </span>= </span>point</span>[</span>prop</span>];
</span></code></pre>
Now the IR generator will insert</a> extra guards to check prop</code> is the string "x":</p>
GuardIsObject Op0
</span>GuardIsString Op1              <---
</span>GuardSpecificAtom Op1 Field0   <---
</span>GuardShape Op0 Field1
</span>LoadFixedSlotResult Op0 Field2
</span></code></pre>
We have two extra guards now (GuardIsString and GuardSpecificAtom) and an extra stub field that stores the atom</em> we're guarding on ("x" in this case), but other than that the IR looks the same as before. The same trick can be used in a lot of other cases: for example, if point</code> would be a proxy/wrapper to our Point object, all we need to do is emit some CacheIR instructions to unwrap it and then we can emit the same IR as before.</p>
We emit the same CacheIR for our Baseline and Ion ICs (but Ion ICs may NOP some type guards, for instance if we know the input is always an object). This is a big improvement over what we had before: our Baseline and Ion ICs used to share some code but not much, because they're quite different. This meant there was a lot of code duplication and many cases we optimized in Ion but not (or differently) in Baseline. CacheIR fixes all this.</p>
Instead of having register allocation, code generation, and high level decisions (which guards to emit) in the same code, we now have a very clear separation of concerns: when we emit the IR, we don't have to worry about registers, stack slots, or boxing formats. When we compile the IR to machine code, we can simply compile one instruction at a time and don't have to worry about what to emit next. This separation has made the code much more readable and maintainable.</p>
Sharing Baseline stub code</h2>
Our Baseline JIT can use the same machine code for many different stubs, because the IC inputs are always boxed Values stored in fixed registers and the IC stub reads things like Shape* pointers and slot offsets from the stub data</em>, they are not baked into the code. Before CacheIR, each Baseline stub had to</a> define an int32 key for this purpose and stubs with the same key could share code. This worked well, but it was tedious to maintain and easy to screw up when making changes to IC code. With CacheIR, we simply use the IR as key: stubs with the same IR can share code. This means the code sharing works automatically and we no longer have to think about it. It eliminates a lot of boilerplate and a whole class of potential bugs.</p>
Removing boilerplate has been a recurring theme, we have seen the same thing with GC tracing for instance. Each Baseline stub that contained a Shape guard had to</a> trace this Shape for GC purposes. Now there's a single place</a> where we trace Shapes stored in stubs. Less boilerplate and much harder to get wrong.</p>
Compiling CacheIR to machine code</h2>
After emitting the IR, we can compile it to native code. The code generation for most CacheIR instructions</a> is shared between Baseline and Ion. As mentioned above, Baseline stub code is shared, so some CacheIR instructions are compiled differently for Baseline and Ion: Ion can bake in values directly into the code, whereas the Baseline code will read them from the stub data.</p>
The CacheIR compiler uses a very simple register allocator to track where each operand lives (register, stack slot, etc) and which registers are available. When we compile an instruction, the first thing we do is request registers</a> for the operands we need, and the register allocator takes care of spilling and restoring values as needed.</p>
Furthermore, the register allocator allows us to generate failure paths</em> automatically. A failure path</a> is the code we emit when a guard fails: it restores the input registers and jumps to the next stub. This works because the register allocator knows the register state at the start of the stub and the current register state.</p>
This makes it much easier to write IC code: you no longer have to remember where all values are stored, which registers are available, and which ones have to be restored. It eliminates another class of subtle bugs.</p>
Improvements</h2>
Tom Schuster (evilpie) has been working on a logging mechamism to find cases our ICs fail to optimize currently. He already fixed</a> numerous performance</a> issues that</a> came up on websites like Facebook and Google Docs. We also had some performance issues on file that used to be hard to fix, but were much easier</a> to optimize with CacheIR. In some cases, fixing an issue found on real-world websites turned out to (unexpectedly) improve benchmarks</a> as well :)</p>
Because the CacheIR we emit is the same for Baseline and Ion, we no longer have cases where Ion has IC support for something but Baseline doesn't. Whenever we add a new optimization to our IC code, it works in both Baseline and Ion. As a result, especially our Baseline JIT now optimizes a lot more things than before.</p>
CacheIR stubs require much less boilerplate and code duplication. The IR instructions are our IC stub building blocks: many instructions can be reused for other things. Every time we converted a stub to CacheIR we saw big</a> improvements</a> in code size</a> and maintainability. This has been incredibly satisfying.</p>
Future plans</h2>
We are still working on converting the remaining Baseline and Ion IC stubs to CacheIR, and we will continue to use our new CacheIR tools to speed up a lot of other operations.</p>
Our Ion optimizing JIT is currently able to pattern match the CacheIR we emitted for Baseline and uses it optimize certain cases. Once we are done converting IC stubs to CacheIR, the next step</a> is to add a generic CacheIR to MIR compiler. This will bring some significant performance wins and will let us remove even more code: whenever we optimize a new case in the CacheIR emitter, we will not only generate Baseline IC code and Ion IC code, but also Ion inline paths from it.</p>
CacheIR will also make it much easier to optimize the ES2015 super</code> property accesses used in derived classes. super.x</code> involves 2 objects: the receiver and the super object. With CacheIR, we can simply add a new input operand to our GetProp IR generator and the IR compiler will do the right thing.</p>
Conclusion</h2>
We've seen how CacheIR is used in SpiderMonkey to simplify/improve IC code generation and remove code duplication and boilerplate. Firefox Nightly builds now optimize more cases than ever, and we are working hard to improve our IC coverage even more. We will also see some serious performance wins from compiling CacheIR to MIR.</p>


W^X JIT-code enabled in Firefox
2015-12-29T00:00:00+00:00
Back in June, I added</a> an option to SpiderMonkey to enable W^X protection of JIT code. The past weeks I've been working on</a> fixing the remaining performance issues and yesterday I enabled W^X on the Nightly channel, on all platforms. What this means is that each page holding JIT code is either executable or</em> writable, never both at the same time.</p>
Why?</h3>
Almost all JITs (including the ones in Firefox until now) allocate memory pages for code with RWX (read-write-execute) permissions. JITs typically need to patch</em> code (for inline caches, for instance) and with writable memory they can do that with no performance overhead. RWX memory introduces some problems though:</p>

Security</strong>: RWX pages make it easier to exploit certain bugs. As a result, all modern operating systems store code in executable but non-writable memory, and data is usually not executable, see W^X</a> and DEP</a>. RWX JIT-code is an exception to this rule and that makes it an interesting target.</li>
Memory corruption</strong>: I've seen some memory dumps for crashes in JIT-code that might have been caused by memory corruption elsewhere. All memory corruption bugs are serious, but if</em> it happens for whatever reason, it's much better to crash immediately.</li>
</ul>
How It Works</h3>
With W^X enabled, all JIT-code pages are non-writable by default. When we need to patch JIT-code for some reason, we use a RAII-class</a>, AutoWritableJitCode</code>, to make the page(s) we're interested in writable (RW), using VirtualProtect</a> on Windows and mprotect</a> on other platforms. The destructor then toggles this back from RW to RX when we're done with it.</p>
(As an aside, an alternative to W^X is a dual-mapping scheme: pages are mapped twice, once as RW and once as RX. In 2010, some people wrote patches</a> to implement this for TraceMonkey, but this work never landed. This approach avoids the mprotect overhead, but for this to be safe, the RW mapping should be in a separate process. It's also more complicated and introduces IPC overhead.)</p>
Performance</h3>
Last week I fixed</a> implicit interrupt checks to work with W^X, got rid of</a> some unnecessary mprotect calls, and optimized</a> code poisoning to be faster with W^X.</p>
After that, the performance overhead was pretty small on all benchmarks and websites I tested: Kraken and Octane are less than 1% slower with W^X enabled. On (ancient) SunSpider the overhead is bigger, because most tests finish in a few milliseconds, so any compile-time overhead is measurable. Still, it's less than 3% on Windows and Linux. On OS X it's less than 4% because mprotect is slower there.</p>
I think W^X works well in SpiderMonkey for a number of reasons:</p>

We run bytecode in the interpreter before Baseline-compiling it. On the web, most functions have less than ~10 calls or loop iterations, so we never JIT those and we don't have any memory protection overhead.</li>
The Baseline JIT uses IC stubs</a> for most operations, but we use indirect calls here, so we don't have to make code writable when attaching stubs. Baseline stubs also share code, so only the first time we attach a particular stub we compile code for it. Ion IC stubs do require us to make memory writable, but Ion doesn't use ICs as much as Baseline.</li>
For asm.js (and soon WebAssembly</a>!), we do AOT-compilation</a> of the whole module. After compilation, we need only one mprotect call to switch everything from RW to RX. Furthermore, this code is only modified on some slow paths, so there's basically no performance overhead for asm.js/WebAssembly code.</li>
</ul>
Conclusion</h3>
I've enabled W^X protection for all JIT-code in Firefox Nightly. Assuming we don't run into bugs or serious performance issues, this will ship in Firefox 46.</p>
Last but not least, thanks to the OpenBSD and HardenedBSD teams for being brave</a> enough to flip</a> the W^X switch before</a> we did!</p>


Testing Math.random(): Crushing the browser
2015-11-30T00:00:00+00:00
(For tl;dr</strong>, see the Conclusion.)</p>
A few days ago, I wrote about</a> Math.random() implementations in Safari and (older versions of) Chrome using only 32 bits of precision. As I mentioned in that blog post, I've been working on</a> upgrading Math.random() in SpiderMonkey to XorShift128+. V8 has been using the same algorithm since last week. (Update Dec 1: WebKit is now also using</a> XorShift128+!)</p>
The most extensive RNG test is TestU01</a>. It's a bit of a pain to run: to test a custom RNG, you have to compile the library and then link it to a test program. I did this initially for the SpiderMonkey shell but after that I thought it'd be more interesting to use Emscripten</a> to compile TestU01 to asm.js so we can easily run it in different browsers.</p>
Today I tried this</a> and even though I had never used Emscripten before, I had it running in the browser in less than an hour. Because the tests can take a long time, it runs in a web worker. You can try it for yourself here</a>.</p>
I also wanted to test window.crypto.getRandomValues()</a> but unfortunately it's not available in workers</a>.</p>
Disclaimer: browsers implement Math functions like Math.sin differently and this can affect their precision. I don't know if TestU01 uses these functions and whether it affects the results below, but it's possible. Furthermore, some test failures are intermittent so results can vary between runs.</p>
Results</h2>
TestU01 has three batteries</em> of tests: SmallCrush, Crush, and BigCrush. SmallCrush runs only a few tests and is very fast. Crush and especially BigCrush have a lot more tests so they are much slower.</p>
SmallCrush</h3>
Running SmallCrush takes about 15-30 seconds. It runs 10 tests with 15 statistics (results). Here are the number of failures I got:</p>
Browser</th> Number of failures</th></tr></thead>

Firefox Nightly</td> 1: BirthdaySpacings</td></tr>
Firefox with XorShift128+</td> 0</td></tr>
Chrome 48</td> 11</td></tr>
Safari 9</td> 1: RandomWalk1 H</td></tr>
Internet Explorer 11</td> 1: BirthdaySpacings</td></tr>
Edge 20</td> 1: BirthdaySpacings</td></tr>
</tbody></table>
Chrome/V8 failing 11 out of 15 is not too surprising</a>. Again, the V8 team fixed this last week and the new RNG should pass SmallCrush.</p>
Crush</h4>
The Crush battery of tests is much more time consuming. On my MacBook Pro, it finishes in less than an hour in Firefox but in Chrome and Safari it can take at least 2 hours. It runs 96 tests with 144 statistics. Here are the results I got:</p>
Browser</th> Number of failures</th></tr></thead>

Firefox Nightly</td> 12</td></tr>
Firefox with XorShift128+</td> 0</td></tr>
Chrome 48</td> 108</td></tr>
Safari 9</td> 33</td></tr>
Internet Explorer 11</td> 14</td></tr>
</tbody></table>
XorShift128+ passes Crush, as expected. V8's previous RNG fails most of these tests and Safari/WebKit isn't doing too great either.</p>
BigCrush</h3>
BigCrush didn't finish in the browser because it requires more than 512 MB of memory. To fix that I probably need to recompile the asm.js code with a different TOTAL_MEMORY value or with ALLOW_MEMORY_GROWTH=1.</p>
Furthermore, running BigCrush would likely take at least 3 hours in Firefox and more than 6-8 hours in Safari, Chrome, and IE, so I didn't bother.</p>
The XorShift128+ algorithm being implemented in Firefox and Chrome should pass BigCrush (for Firefox, I verified this</a> in the SpiderMonkey shell).</p>
About IE and Edge</h2>
I noticed Firefox (without XorShift128+) and Internet Explorer 11 get very</strong> similar test failures. When running SmallCrush, they both fail the same BirthdaySpacings test. Here's the list of Crush failures they have in common:</p>

11  BirthdaySpacings, t = 2</li>
12  BirthdaySpacings, t = 3</li>
13  BirthdaySpacings, t = 4</li>
14  BirthdaySpacings, t = 7</li>
15  BirthdaySpacings, t = 7</li>
16  BirthdaySpacings, t = 8</li>
17  BirthdaySpacings, t = 8</li>
19  ClosePairs mNP2S, t = 3</li>
20  ClosePairs mNP2S, t = 7</li>
38  Permutation, r = 15</li>
40  CollisionPermut, r = 15</li>
54  WeightDistrib, r = 24</li>
75  Fourier3, r = 20</li>
</ul>
This suggests the RNG in IE may be very similar to the one we used in Firefox (imported from Java decades ago). Maybe Microsoft imported the same algorithm from somewhere? If anyone on the Chakra team is reading this and can tell us more, it would be much appreciated :)</p>
IE 11 fails 2 more tests that pass in Firefox. Some failures are intermittent and I'd have to rerun the tests to see if these failures are systematic.</p>
Based on the SmallCrush results I got with Edge 20, I think it uses the same algorithm as IE 11 (not too surprising). Unfortunately the Windows VM I downloaded to test Edge shut down for some reason when it was running Crush so I gave up and don't have full results for it.</p>
Conclusion</h3>
I used Emscripten to port TestU01 to the browser</a>. Results confirm most browsers currently don't use very strong RNGs for Math.random(). Both Firefox and Chrome are implementing XorShift128+, which has no systematic failures on any of these tests.</p>
Furthermore, these results indicate IE and Edge may</strong> use the same algorithm as the one we used in Firefox.</p>


Math.random() and 32-bit precision
2015-11-27T00:00:00+00:00
Last week, Mike Malone, CTO of Betable</a>, wrote a very insightful and informative article</a> on Math.random() and PRNGs in general. Mike pointed out V8/Chrome used a pretty bad algorithm to generate random numbers and, since this week, V8 uses a better algorithm.</p>
The article also mentioned the RNG we use in Firefox (it was copied from Java a long time ago) should be improved as well. I fully agree with this. In fact, the past days I've been working on</a> upgrading Math.random() in SpiderMonkey to XorShift128+, see bug 322529</a>. We think XorShift128+ is a good choice: we already had a copy of the RNG in our repository, it's fast (even faster than our current algorithm!), and it passes BigCrush (the most complete RNG test available).</p>
While working on this, I looked at a number of different RNGs and noticed Safari/WebKit uses GameRand</a>. It's extremely fast but very</strong> weak. (Update Dec 1: WebKit is now also using</a> XorShift128+, so this doesn't apply to newer Safari/WebKit versions.)</p>
Most interesting to me, though, was that, like the previous V8 RNG, it has only 32 bits of precision: it generates a 32-bit unsigned integer and then divides that by UINT_MAX + 1</code>. This means the result of the RNG is always one of about 4.2 billion different numbers, instead of 9007199 billion (2^53). In other words, it can generate 0.00005% of all numbers an ideal RNG can generate.</p>
I wrote a small testcase</a> to visualize this. It generates random numbers and plots all numbers smaller than 0.00000131072.</p>
Here's the output I got in Firefox (old algorithm) after generating 115 billion numbers:</p>
</p>
And a Firefox build with XorShift128+:</p>
</p>
In Chrome (before Math.random was fixed):</p>
</p>
And in Safari:</p>
</p>
These pics clearly show the difference in precision.</p>
Conclusion</h3>
Safari and older Chrome versions both generate random numbers with only 32 bits of precision. This issue has been fixed in Chrome, but Safari's RNG should probably be fixed as well. Even if we ignore its suboptimal precision, the algorithm is still extremely weak.</p>
Math.random() is not a cryptographically-secure PRNG</a> and should never be used for anything security-related, but, as Mike argued, there are a lot of much better (and still very fast) RNGs to choose from.</p>


Making `this` a real binding in SpiderMonkey
2015-11-25T00:00:00+00:00
Last week I landed bug 1132183</a>, a pretty large patch rewriting the implementation of this</code> in SpiderMonkey.</p>
How this</em> Works In JS</h3>
In JS, when a function is called, an implicit this</code> argument is passed to it. In strict mode, this</code> inside the function just returns that value:</p>
function </span>f</span>() { </span>"use strict"</span>; </span>return </span>this</span>; }
</span>f</span>.</span>call</span>(</span>123</span>); </span>// 123
</span></code></pre>
In non-strict functions, this</code> always returns an object. If the this-argument is a primitive value, it's boxed</em> (converted to an object):</p>
function </span>f</span>() { </span>return </span>this</span>; }
</span>f</span>.</span>call</span>(</span>123</span>); </span>// returns an object: new Number(123)
</span></code></pre>
Arrow functions don't have their own this</code>. They inherit</a> the this</code> value from their enclosing scope:</p>
function </span>f</span>() {
</span>    </span>"use strict"</span>;
</span>    () </span>=> </span>this</span>; </span>// `this` is 123
</span>}
</span>f</span>.</span>call</span>(</span>123</span>);
</span></code></pre>
And, of course, this</code> can be used inside eval</code>:</p>
function </span>f</span>() {
</span>    </span>"use strict"</span>;
</span>    </span>eval</span>(</span>"this"</span>); </span>// 123
</span>}
</span>f</span>.</span>call</span>(</span>123</span>);
</span></code></pre>
Finally, this</code> can also be used in top-level code. In that case it's usually</em> the global object (lots of hand waving here).</p>
How this</em> Was Implemented</h3>
Until last week, here's how this worked in SpiderMonkey:</p>

Every stack frame had a this-argument,</li>
Each this</code> expression in JS code resulted in a single bytecode op (JSOP_THIS),</li>
This bytecode op boxed</em> the frame's this-argument if needed and then returned the result.</li>
</ul>
Special case: to support the lexical this</a> behavior of arrow functions, we emitted JSOP_THIS when we defined (cloned) the arrow function and then copied the result to a slot on the function. Inside the arrow function, JSOP_THIS would then load the value from that slot</a>.</p>
There was some more complexity around eval</code>: eval-frames also had their own this-slot, so whenever we did a direct eval</code> we'd ensure the outer frame had a boxed (if needed) this-value and then we'd copy it to the eval frame.</p>
The Problem</h3>
The most serious problem was that it's fundamentally incompatible with ES6 derived class constructors</em></a>, as they initialize their 'this' value dynamically when they call super(). Nested arrow functions (and eval) then have to 'see' the initialized this</code> value, but that was impossible to support because arrow functions and eval frames used their own</em> (copied) this value, instead of the updated one.</p>
Here's a worst-case example:</p>
class </span>Derived </span>extends </span>Base </span>{
</span>    </span>constructor</span>() {
</span>        </span>var </span>arrow </span>= () </span>=> </span>this</span>;
</span>
</span>        </span>// Runtime error: `this` is not initialized inside `arrow`.
</span>        </span>arrow</span>();
</span>
</span>        </span>// Call Base constructor, initialize our `this` value.
</span>        </span>eval</span>(</span>"super()"</span>);
</span>
</span>        </span>// The arrow function now returns the initialized `this`.
</span>        </span>arrow</span>();
</span>    }
</span>}
</span></code></pre>
We currently (temporarily!) throw an exception when arrow functions or eval are used in derived class constructors in Firefox Nightly.</p>
Boxing this</code> lazily also added extra complexity and overhead. I already mentioned how we had to compute</em> this</code> whenever we used eval</code>.</p>
The Solution</h3>
To fix these issues, I made this</code> a real binding:</p>

Non-arrow functions that use this</code> or eval</code> define a special .this</code> variable,</li>
In the function prologue, we get the this-argument, box it if needed (with a new op, JSOP_FUNCTIONTHIS) and store it in .this</code>,</li>
Then we simply use that variable each time this</code> is used.</li>
</ul>
Arrow functions and eval frames no longer have their own this-slot, they just reference the .this</code> variable of the outer function. For instance, consider the function below:</p>
function </span>f</span>() {
</span>    </span>return </span>() </span>=> </span>this</span>.</span>foo</span>();
</span>}
</span></code></pre>
We generate bytecode similar to the following pseudo-JS:</p>
function </span>f</span>() {
</span>    </span>var</span> .</span>this </span>= </span>BoxThisIfNeeded</span>(</span>this</span>);
</span>    </span>return </span>() </span>=> </span>(.</span>this</span>).</span>foo</span>();
</span>}
</span></code></pre>
I decided to call this variable .this</code>, because it nicely matches the other magic 'dot-variable' we already had, .generator</code>. Note that these are not valid variable names so JS code can't access them. I only had to make sure with-statements don't intercept the .this</code> lookup when this</code> is used inside a with-statement...</p>
Doing it this way has a number of benefits: we only have to check for primitive this</code> values at the start of the function, instead of each time this</code> is accessed (although in most cases our optimizing JIT could/can eliminate these checks, when it knows the this-argument must be an object). Furthermore, we no longer have to do anything special for arrow functions or eval; they simply access a 'variable' in the enclosing scope and the engine already knows how to do that.</p>
In the global scope (and in eval or arrow functions in the global scope), we don't use a binding for this</code> (I tried this initially but it turned out to be pretty complicated). There we emit JSOP_GLOBALTHIS for each this-expression, then that op gets the this</code> value from a reserved slot on the lexical scope. This global this</code> value never changes, so the JITs can get it from the global lexical scope at compile time and bake it in as a constant :) (Well.. in most cases. The embedding can run scripts with a non-syntactic scope chain, in that case we have to do a scope walk to find the nearest lexical scope. This should be uncommon and can be optimized/cached if needed.)</p>
The Debugger</h3>
The main nuisance was fixing the debugger: because we only give (non-arrow) functions that use this</code> or eval</code> their own this-binding, what do we do when the debugger wants to know the this-value of a frame without</em> a this-binding?</p>
Fortunately, the debugger (DebugScopeProxy, actually) already knew how to solve a similar problem that came up with arguments</code> (functions that don't use arguments</code> don't get an arguments-object, but the debugger can request one anyway), so I was able to cargo-cult and do something similar for this</code>.</p>
Other Changes</h3>
Some other changes I made in this area:</p>

In bug 1125423</a> I got rid of the innerObject/outerObject/thisValue Class hooks (also known as the holy grail</a>). Some scope objects had a (potentially effectful) thisValue hook to override their this</code> behavior, this made it hard to see what was going on. Getting rid of that made it much easier to understand and rewrite the code.</li>
I posted patches in bug 1227263</a> to remove the this</code> slot from generator objects, eval frames and global frames.</li>
IonMonkey was unable to compile top-level scripts that used this</code>. As I mentioned above, compiling the new JSOP_GLOBALTHIS op is pretty simple in most cases; I wrote a small patch to fix this (bug 922406</a>).</li>
</ul>
Conclusion</h3>
We changed the implementation of this</code> in Firefox 45. The difference is (hopefully!) not observable, so these changes should not break anything or affect code directly. They do, however, pave the way for more performance work and fully compliant ES6 Classes</a>! :)</p>


Using Rust to generate Mercurial short-hash collisions
2015-05-05T00:00:00+00:00
At Mozilla, we use Mercurial</a> for the main Firefox repository. Mercurial, like Git, uses SHA1</a> hashes to identify a commit.</p>
Short hashes</h3>
SHA1 hashes are fairly long, a string of 40 hex characters (160 bits), so Mercurial and Git allow using a prefix of that, as long as the prefix is unambiguous. Mercurial also typically only shows the first 12 characters (let’s call them short hashes</em>), for instance:</p>
$</span> hg id
</span>34828fed1639
</span>$</span> hg log</span> -r</span> tip
</span>changeset:</span>   242221:312707328997
</span>tag:</span>         tip
</span>...
</span></code></pre>
And those are the hashes most Mercurial users use, for instance they are posted in Bugzilla</a> whenever we land a patch etc.</p>
Collisions with short hashes are much more likely than full SHA1 collisions, because the short hashes are only 48 bits long. As the Mercurial FAQ</a> states, such collisions don’t really matter, because Mercurial will check if the hash is unambiguous and if it’s not it will require more than 12 characters.</p>
So, short hash collisions are not the end of the world, but they are inconvenient because the standard 12-chars hg commit ids will become ambiguous and unusable. Fortunately, the mozilla-central repository</a> at this point does not contain any short hash collisions (it has about 242,000 commits).</p>
Finding short-hash collisions</h3>
I’ve wondered for a while, can we create a commit that has the same short hash as another commit in the repository?</p>
A brute force attack that works by committing and then reverting changes to the repository should work, but it’d be super slow. I haven’t tried it, but it’d probably take years to find a collision. Fortunately, there’s a much faster way to brute force this. Mercurial computes the commit id/hash like this</a>:</p>
hash </span>= </span>sha1</span>(</span>min</span>(p1, p2) + </span>max</span>(p1, p2) + contents)
</span></code></pre>
Here p1 and p2 are the hashes of the parent commits, or a null hash (all zeroes) if there’s only one parent. To see what contents</em> is, we can use the hg debugdata</em> command:</p>
$</span> hg debugdata</span> -c</span> 34828fed1639
</span>40c6a58ef0be7591e6b0d48b36a8e1f88486b0ee
</span>Carsten </span>"Tomcat"</span> Book <cbook@mozilla.com>
</span>1430739274 -7200
</span>extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/chromium_en_US.dic_delta
</span>...list</span> of changed files...
</span>
</span>merge</span> mozilla-inbound to mozilla-central a=merge
</span></code></pre>
Perfect! This contains the commit message, so all we have to do is append some random data to the commit message, compute the (short) hash, check if there’s a collision and repeat until we find a match.</p>
I wrote a small Rust program</a> to brute-force this. You can use it like this (I used the popular mq extension</a>, there are other ways to do it):</p>
$</span> cd mozilla-central
</span>$</span> echo </span>"Foo" </span>>> CLOBBER </span># make a random change
</span>$ hg qnew patch</span> -m </span>"Some message"
</span>$</span> hgcollision
</span>...snip...
</span>Got</span> 242223 prefixes
</span>Generated</span> random prefix: 1631965792_
</span>Tried</span> 242483200 hashes
</span>Found</span> collision! Prefix: b991f0726738, hash: b991f072673876a64c7a36f920b2ad2885a84fac
</span>Add</span> this to the end of your commit message: 1631965792_24262171
</span></code></pre>
After about 2 minutes it’s done and tells us we have to append “1631965792_24262171” to our commit message to get a collision! Let’s try it (we have to be careful to preserve the original date/time, or we’ll get a different hash):</p>
$</span> hg log</span> -r</span> tip</span> --template </span>"{date|isodatesec}"
</span>2015-05-05</span> 20:21:59 +0200
</span>$</span> hg qref</span> -m </span>"Some message1631965792_24262171"</span> -d </span>"2015-05-05 20:21:59 +0200"
</span>$</span> hg id
</span>b991f0726738</span> patch/qbase/qtip/tip
</span>$</span> hg log</span> -r</span> b991f0726738
</span>abort:</span> 00changelog.i@b991f0726738: ambiguous identifier!
</span></code></pre>
Voilà! We successfully created a Mercurial short hash collision!</p>
And no, I didn’t use this on any patches I pushed to mozilla-central..</p>
Rust</h3>
The Rust source code is available here</a>. It was my first, quick-and-dirty Rust program but writing it was a nice way to get more familiar with the language. I used the rust-crypto crate to calculate SHA1 hashes, installing and using it was much easier than I expected. Pretty nice experience.</p>
The program can check about 100 million hashes in one minute on my laptop. It usually takes about 1-5 minutes to find a collision, this also depends on the size of the repository (mozilla-central has about 242,000 commits). It’d be easy to use multiple threads (you can also just use X processes though) and there are probably a lot of other ways to improve it. For this experiment it was good and fast enough to get the job done :)</p>


Fast arrow functions in Firefox 31
2014-04-11T00:00:00+00:00
Last week I spent some time optimizing ES6 arrow functions. Arrow functions allow you to write function expressions like this:</p>
a</span>.</span>map</span>(</span>s </span>=> </span>s</span>.length);
</span></code></pre>
Instead of the much more verbose:</p>
a</span>.</span>map</span>(</span>function</span>(</span>s</span>){ </span>return </span>s</span>.length });
</span></code></pre>
Arrow functions are not just syntactic sugar though, they also bind their this-value lexically. This means that, unlike normal functions, arrow functions use the same this-value as the script in which they are defined. See the documentation</a> for more info.</p>
Firefox has had support for arrow functions since Firefox 22, but they used to be slower than normal functions for two reasons:</p>

Bound functions</strong>: SpiderMonkey used to do the equivalent of |arrow.bind(this)| whenever it evaluated an arrow expression. This made arrow functions slower than normal functions because calls to bound functions are currently not optimized or inlined in the JITs. It also used more memory because we’d allocate two function objects instead of one for arrow expressions.

In bug 989204</a> I changed this so that we treat arrow functions exactly like normal function expressions, except that we also store the lexical this-value in an extended function slot. Then, whenever this is used inside the arrow function, we get it from the function’s extended slot. This means that arrow functions behave a lot more like normal functions now. For instance, the JITs will optimize calls to them and they can be inlined.</li>
Ion compilation</strong>: IonMonkey could not compile scripts containing arrow functions. I fixed this in bug 988993</a>.</li>
</ol>
With these changes, arrow functions are about as fast as normal functions. I verified this with the following micro-benchmark:</p>
function </span>test</span>(</span>arr</span>) {
</span>    </span>var </span>t </span>= new Date;
</span>    </span>arr</span>.</span>reduce</span>((</span>prev</span>, </span>cur</span>) </span>=> </span>prev </span>+ </span>cur</span>);
</span>    </span>alert</span>(new Date - </span>t</span>);
</span>}
</span>var </span>arr </span>= [];
</span>for </span>(</span>var </span>i</span>=</span>0</span>; </span>i</span><</span>10000000</span>; </span>i</span>++) {
</span>    </span>arr</span>.</span>push</span>(</span>3</span>);
</span>}
</span>test</span>(</span>arr</span>);
</span></code></pre>
I compared a nightly build from April 1st to today’s nightly and got the following results:
</p>
We’re 64x faster because Ion is now able to inline the arrow function directly without going through relatively slow bound function code on every call.</p>
Other browsers don’t support arrow functions yet, so they are not used a lot on the web, but it’s important to offer good performance for new features if we want people to start using them. Also, Firefox frontend developers love arrow functions (grepping for “=>” in browser/ shows hundreds of them) so these changes should also help the browser itself :)</p>


Hello world!
2014-01-11T00:00:00+00:00
Welcome to WordPress</del> Jekyll</del> Zola. This is your first post. Edit or delete it, then start blogging!</p>

Jan de Mooij

Some SpiderMonkey optimizations in Firefox Quantum

CacheIR: A new approach to Inline Caching in Firefox

W^X JIT-code enabled in Firefox

Conclusion</h3> I've enabled W^X protection for all JIT-code in Firefox Nightly. Assuming we don't run into bugs or serious performance issues, this will ship in Firefox 46.</p>

Testing Math.random(): Crushing the browser

Math.random() and 32-bit precision

Making `this` a real binding in SpiderMonkey

How this</em> Was Implemented</h3> Until last week, here's how this worked in SpiderMonkey:</p> Every stack frame had a this-argument,</li> Each this</code> expression in JS code resulted in a single bytecode op (JSOP_THIS),</li>

Using Rust to generate Mercurial short-hash collisions

Fast arrow functions in Firefox 31

Hello world!