Wednesday, May 29, 2013

Towards Cloud-based, Crowd-sourced benchmarking? (HSBencher)

Producing benchmarking plots using text files and gnuplot has always left me disappointed. We never get things as “push button” as we would like. Further, there’s always too much spit and chewing gum involved, and the marginal cost of asking a new question of the data is too high.

Thus I’m pleased to announce “HSBencher”, a package I just released on Hackage, here (and github here), which tackles a small part of this problem. Namely, it handles running benchmarks while varying parameters and then uploading the resulting benchmark data to Google Fusion Tables.

The theory is that it should then be easier to: (1) access the data from any program, anywhere; (2) pass around live graphs that are just a URL; and (3) write scripts to analyze the data for trends. In this post I’ll describe how to use the library, and then I’ll speculate a bit about how it could be used to harvest benchmarking data from more machines via “crowd sourcing”.

But first, a few notes about where HSBencher fits in vis-a-vis other benchmarking frameworks.
  • HSBencher runs coarse granularity benchmarks, where each benchmark run involves launching a new process. This is appropriate for parallelism benchmarks which need to run for a while, and also for inter-language comparisons.
  • By contrast, the popular Criterion package is great for fine-grained Haskell benchmarks.
  • HSBencher is language-agnostic. What it knows about GHC, cabal, and make is encapsulated in plugins called BuildMethods, with the built-in ones here.
  • In contrast, there are plenty of language-specific benchmarking tools. For example, if you are writing JVM based code, you might useJMeter, and then you could even get some support from Jenkins CI. Likewise, for Haskell, Fibon offers some advantages in that it understands GHC and GHC configuration options.
  • Many large benchmark suites just come with simple build systems, such as a series of Makefiles. These can be wrapped to work with HSBencher.


HSBencher Installation

First, getting the library should be a one-liner, but requires GHC >= 7.4:
  cabal install hsbencher -ffusion
If you have problems with the fusion table flag turned on, try using the latest version of handa-gdata from here or here.  Also wipe out your ~/.cabal/googleAuthTokens directory.

(UPDATE: at the moment HSBencher needs handa-gdata-0.6.2, which is pre-release so use the HEAD version.)



HSBencher Usage Model


HSBencher is configured the same way as xmonad, by writing an executable Haskell file and importing the library.

  import HSBencher
  main = defaultMainWithBechmarks
        [ Benchmark "bench1/bench1.cabal" ["1000"] $
          Or [ Set NoMeaning (RuntimeParam "+RTS -qa -RTS")
             , Set NoMeaning (RuntimeEnv "HELLO" "yes") ] ]
The resulting executable or script will also take a number of command line arguments, but the main thing is defining the list of benchmarks, and the configuration space associated with each benchmark.
The configuration spaces are created with a nested series of conjunctions and disjunctions (And and Or constructors). And is used for setting multiple options simultaneously, and Or creates alternatives which will be exhaustively explored. Creating a series of conjoined Ors of course creates a combinatorial explosion of possible configurations to explore.

The Set form takes two things. The latter argument contains strings that are opaque to the benchmarking system. These are runtime/compile-time command-line arguments as well as environment variable settings. Together these strings implement the desired behavior (see the ParamSetting type). But it’s often useful for the benchmarking framework to know what a setting is actually doing in terms of commonly defined concepts like “setting the number of threads”. Hence the first parameter to Set is a machine-readable encoding of this meaning. For example, a meaning of Threads 32 attached to a setting will let the benchmark framework know how to file away the benchamrk result in the Google Fusion Table.

You can run the above script (without Fusion Table upload) using the following:
  runghc minimal.hs
That will leave the output data set in a text file (results_UNAME.dat) with the full log in bench_UNAME.log. For example:
 # TestName Variant NumThreads   MinTime MedianTime MaxTime  Productivity1 Productivity2 Productivity3
 #
 # Tue May 28 23:41:55 EDT 2013
 # Darwin RN-rMBP.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan  6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64
 # Ran by: rrnewton
 # Determined machine to have 8 hardware threads.
 #
 # Running each test for 1 trial(s).
 # Git_Branch: master
 # Git_Hash: f6e0677353680f4c6670316c9c91ce729f19ed0e
 # Git_Depth: 244
 #  Path registry: fromList []
 bench1          1000                 none     0       0.3 0.3 0.3
 bench1          1000                 none     0       0.3 0.3 0.3
We’ll return to this below and see the fuller schema used by the Fusion table output.

 Setting up Fusion Tables

Because you installed with -ffusion above, you already have the Fusion Tables library.  Unfortunately, the Haskell bindings for Google fusion tables are quite limited at the moment, and not that robust. I’m afraid right now [2013.05.29] you need to create at least one table manually before following the instructions below.

Once you’re done with that, go ahead and go to the Google API Console and click Create project.


Next you’ll want to flip the switch to turn on “Fusion Tables API”. It will look like this:



Then “Create an OAuth 2.0 client ID…”:



And select “Installed Application”:


Finally you’ll see something like this on the “API Access” tab, which lists the Client ID and “secret”:


You can provide those to your benchmarker on the command line, but it’s often easier to stick them in environment variables like this:
export HSBENCHER_GOOGLE_CLIENTID=916481248329.apps.googleusercontent.com
export HSBENCHER_GOOGLE_CLIENTSECRET=e2DVfwqa8lYuTu-1oKg42iKN
After that, you can run the benchmark script again:
runghc minimal.hs --fusion-upload --name=minimal
When that finishes, if you go to drive.google.com (and filter for Tables) you should see a new table called “minimal”. And will see a similar record as above:



The rest of the columns currently included are:
MINTIME_PRODUCTIVITY
MEDIANTIME_PRODUCTIVITY
MAXTIME_PRODUCTIVITY
ALLTIMES
TRIALS
COMPILER
COMPILE_FLAGS
RUNTIME_FLAGS
ENV_VARS
BENCH_VERSION
BENCH_FILE
UNAME
PROCESSOR
TOPOLOGY
GIT_BRANCH
GIT_HASH
GIT_DEPTH
WHO
ETC_ISSUE
LSPCI
FULL_LOG
If you play around with Fusion Tables, you’ll find that you can do basic plotting straight off this data, for example, threads vs. time:

(Beyond 12 is using hyperthreading on this machine.) Filtering support is great, group-by or other aggregation seems absent currently. Thus you’ll ultimately want to have other scripts that ingest the data, analyze it, and republish it in a summarized form.

 Benchmark format and BuildMethods

The Benchmark data structure in the example above specifies a .cabal file as the target, and 1000 as the command line arguments to the resulting executable. (Note that the RunTimeParams also ultimately get passed as command line arguments to the executable.)

The convention in HSBencher is that a path uniquely identifies an individual benchmark; properties of that file (or the surrounding files) determine which build methods applies. For .cabal or .hs files, the extensions activates the appropriate build method.

The interface for build methods is defined here. and the full code of the ghc build method is show below:
ghcMethod = BuildMethod
  { methodName = "ghc"
  , canBuild = WithExtension ".hs"
  , concurrentBuild = True 
  , setThreads = Just $ \ n -> [ CompileParam "-threaded -rtsopts"
                               , RuntimeParam ("+RTS -N"++ show n++" -RTS")]
  , clean = \ pathMap bldid target -> do
     let buildD = "buildoutput_" ++ bldid
     liftIO$ do b <- span=""> doesDirectoryExist buildD
                when b$ removeDirectoryRecursive buildD
     return ()
  , compile = \ pathMap bldid flags target -> do
     let dir  = takeDirectory target
         file = takeBaseName target
         suffix = "_"++bldid
         ghcPath = M.findWithDefault "ghc" "ghc" pathMap
     log$ tag++" Building target with GHC method: "++show target  
     inDirectory dir $ do
       let buildD = "buildoutput_" ++ bldid
       liftIO$ createDirectoryIfMissing True buildD
       let dest = buildD </> file ++ suffix
       runSuccessful " [ghc] " $
         printf "%s %s -outputdir ./%s -o %s %s"
           ghcPath file buildD dest (unwords flags)
       return (StandAloneBinary$ dir </> dest)
  }
 where
  tag = " [ghcMethod] "
That’s it!

In this case we use a custom build directory triggered off the build ID so that concurrent builds of different configurations are possible. More details on the conventions that benchmarks should follow can be found in the README.

 Crowd sourcing?

Performance regression testing, say, for a compiler like GHC, is no easy thing. The surface area is large (libraries, language features), as is the variety of hardware architectures it will run on in the wild. One idea is that if we could make it absolutely trivial for users to run the benchmark suite, then we could aggregate that data in the cloud, and this could provide a variety of combinations of OS, hardware, and libraries.

Unfortunately, the Google Drive permissions model doesn’t support this well right now. Each Fusion Table is either readable or editable by a particular usable. There’s no way to switch one to “append-only” or even make it world-writable, even if one wanted to live dangerously.

Nevertheless, it would be possible to follow a “pull request” kind of model where a particular HSBencher-based client uploads to a personal Fusion table, and then sends a request to a daemon that ingests the data into the global, Aggregated fusion table for that project. We’ll see.

Future work and feedback

For our own use, I’ve converted the monad-par benchmarks, turning them into this script. I have my eye on the GHC nofib tests as another target. If you get a chance to try HSBencher, let me know how it works for you. It will need many more features, I’m sure, and I’d like to hear which your benchmark suite needs.