Friday, April 5, 2013

Don't mux parallel shell output! Introducing Hydra-print.

In this age of multicore-everything (your laptop, your phone, your toaster) there’s no excuse for anything not to run in parallel, right? So what about text-based shell programs? On the one hand, it has always been easy to launch things in parallel, say with bash’s support for backgrounding jobs, or using make -j.

But, on the other hand, I’ve never been happy with the output of make -j. Interleaving lines of output from parallel subprocesses isn’t very user friendly. (Indeed, cabal -j used to do the same thing, and recently it has switched to simply hiding the output instead.) Thus I’ve written a little utility than dynamically splits the terminal window when the compute job enters a parallel region (video below).

It’s called “hydra-print” and is both a Haskell library and a pair of command-line programs (hydra-view and hydra-head). You can find it at its Github page here and the released package is on Hackage here.

After running cabal install hydra-print, the simplest way to use it is to bring up a server like this:

hydra-view

And then arrange for a script to pipe the output of multiple commands to it in parallel:

(command1 | hydra-head) &
(command2 | hydra-head) &

It’s possible to compose command’s output sequentially as well, by asking hydra-head to return a named pipe, and deleting it to signal that the stream is finished.

p=`hydra-head -p`
command1 >> $p
command2 >> $p
rm -f $p

Further, its often convenient to use hydra-view to launch another command, using it to seed the view with output:

hydra-view -- make -j

The repository contains an example Makefile here, which shows how to adapt a Makefile to use hydra-print without any direct support builtin to Make. Of course, built-in support is better, and I hope to come up with a patch for cabal-install to use hydra-print soon.

Using the Haskell Library

The library is based on the recent io-streams package, and makes it very easy to send several kinds of Haskell data streams into hydra-print.

import qualified System.IO.Streams as S
import UI.HydraPrint

The two main ways to use the library are to (1) bring up a static interface with a given number of windows, or (2) bring up a dynamic view monitoring a changing collectionn of streams.

The former can be called with hydraPrintStatic defaultHydraConf (zip names strms). It’s type signature is:

hydraPrintStatic :: HydraConf -> [(String, InputStream ByteString)] -> IO ()

As you can see, it takes a fixed number of streams (in a list). A dynamic set of streams is represented as – what else – a stream of streams!

hydraPrint :: HydraConf -> InputStream (String, InputStream ByteString) -> IO ()

In a parallel program one might use a Control.Concurrent.Chan to aggregate streams and send them to hydraPrint:

strmSrc <- span=""> chanToInput strmChan
hydraPrint defaultHydraConf strmSrc

Finally, I think the most obvious improvement to make to this little utility would be to provide the option of a tmux interface instead. This would provide much more functionality and should be pretty straightforward, because tmux itself is quite scriptable. Emacs [client] would be another option.

Enjoy!

Friday, May 4, 2012

How to write hybrid CPU/GPU programs with Haskell

What’s better than programming a GPU with a high-level, Haskell-embedded DSL (domain-specific-language)? Well, perhaps writing portable CPU/GPU programs that utilize both pieces of silicon—with dynamic load-balancing between them—would fit the bill.

This is one of the heterogeneous programming scenarios supported by our new meta-par packages. A draft paper can be found here, which explains the mechanism for building parallel schedulers out of "mix-in" components. In this post, however, we will skip over that and take a look at CPU/GPU programming specifically.

This post assumes familiarity with the monad-par parallel programming library, which is described in this paper.

Getting Started

First, we install the just-released meta-par-accelerate package:

cabal install c2hs
cabal install meta-par-accelerate

If you have CUDA 4.1 installed, then just remove the "-f-cuda" (or remove the entire middle line). And then we import the following module:

import Control.Monad.Par.Meta.AccSMP

This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par multicore CPU) scheduling: It also reexports the ParAccelerate type class which provides the ability to launch GPU computations from within a Par computation.

Next, we also import Accelerate itself to so that we can express Acc computations that can run on the GPU:

import Data.Array.Accelerate
import Data.Array.Accelerate.CUDA as CUDA

(By the way, this blog post is an executable literate Haskell file that can be found here.)
Now we are ready to create a trivial Accelerate computation:

triv :: Acc (Scalar Float)
triv = let arr = generate (constant (Z :. (10::Int))) 
                          (\ i -> 3.3 )
       in fold (+) 0 arr

This creates a ten element one-dimensional array, and then sums up the elements. We could run this directly using CUDA, which would print out Array (Z) [33.0]. That’s just Accelerate’s fancy way of saying "33.0"—a scalar is a zero-dimensional array. Here’s how we invoke Accelerate/CUDA:

runDirect = print (CUDA.run triv)

If we are inside a Par computation, on the other hand, we simply use runACC or runAccWith:

runWPar1   = print (runPar (runAcc triv))
runWPar2   = print (runPar (runAccWith CUDA.run triv))

The former uses the default Accelerate implementation. The latter specifies which Accelerate implementation to use. After all, there should ultimately be several: OpenCL, CUDA, plus various CPU backends. (In the future, we plan to add the ability to change the default Accelerate backend either at the runPar site, or statically. Stay tuned for that. But for now just use runAccWith.)

The reader might protest that it is possible to use CUDA.run directly within a Par computation, so why go to the extra trouble? The advantage of using runAcc is that it informs the Par scheduler of what’s going on. The scheduler can therefore execute other work on the CPU core that would otherwise be waiting for the GPU. An application could achieve the same effect by creating a dedicated thread to talk to the GPU, but that wouldn’t jive well with a pure computation (forkIO), and it’s easier to let meta-par handle it anyway.

The second benefit of integrated scheduling is that the scheduler can automatically divide work between the CPU and GPU. Eventually, when there are full-featured, efficient CPU-backends for Accelerate, this will happen transparently. For now you need to use unsafeHybrid described in the next section. Finally, our soon-forthcoming CPU/GPU/Distributed schedulers can make more intelligent decisions if they know where all the calls to GPU computations occur.

Hybrid CPU/GPU workloads.

The meta-par and meta-par-accelerate packages, as currently released, include a generalized work-stealing infrastructure. The relevant point for our purposes here, is that the CPU and GPU can each steal work from one another. Work-stealing is by no means the most sophisticated CPU/GPU partitioning on the scene. Much literature has been written on the subject, and it can get quite sophisticated (for example, modeling memory transfer time). However, as on regular multicores, work-stealing provides an admirable combination of simplicity and efficacy. For example, if a given program runs much better on the CPU or the GPU, respectively, then that device will end up doing more of the work.

In the current release, we use unsafeHybridWith, documented here, to spawn a task with two separate implementations—one CPU and one GPU—leaving the scheduler to choose between them. Here’s a silly example:

hybrid :: Par (IVar Float)
hybrid = unsafeHybridWith CUDA.run (`indexArray` Z) 
                          (return 33.0, triv)

runHybrid = print (runPar (hybrid >>= get))

The call to unsafeHybridWith is passed a task that consists of a separate CPU (return 33.0) and GPU (triv) component. We must guarantee that the two computations yield the same result (hence the "unsafe" moniker). The indexArray bit is a conversion function that converts the result from the GPU computation to a type that matches the result from the CPU alternative.

Typically, you would write a recursive algorithm like the following:

-- | Recursive mergesort that bottoms out to CPU or GPU:
parSort :: Vector Int -> Par (Vector Int)
parSort vec =
  if length vec <= threshold
  then unsafeHybridWith CUDA.run undefined
                        (cpuSort vec, gpuSort vec) >>= get
  else let n      = (length vec) `div` 2
           (l, r) = splitAt n vec
       in do lf <- spawn_ (parSort l)
             r' <-         parSort r
             l' <- get lf
             parMerge l' r'

parMerge :: Vector Int -> Vector Int -> Par (Vector Int)
cpuSort :: Vector Int -> Par (Vector Int)
gpuSort :: Vector Int -> Acc (Array DIM1 Int)
threshold :: Int

The user of this API still has to make some tough decisions. Setting the threshold determines the granularity at which work will be farmed out between the CPU and GPU. To achieve effective load balance, the problem must be subdivided into sufficiently small pieces to deal with the slower rate at which the CPU may complete work items. Further, in this particular example the GPU may not have enough memory to process the entire array, meaning that the problem must be subdivided at least until it fits in memory.

On the other hand, subdividing the problem further means more round trips through the GPU, more driver calls, synchronization. In this particular example the total volume of data transferred is the same (it’s linear), but for other algorithms that may not be the case.

Notes for Hackers

If you want to work with the github repositories, you need to have GHC 7.4 and the latest cabal-install (0.14.0). You can check everything out here:

git clone git://github.com/simonmar/monad-par.git --recursive

Try make mega-install-gpu if you already have CUDA installed on your machine. Explore the README’s here and here for more information.

Nightly Testing

The GPU-programming portions of meta-par require the latest and greatest GHC (7.4). Their nightly testing results are on a Jenkins instance here. The Jenkins setup tests Linux and Windows. Mac OS we test regularly on our laptops, but don’t have a Jenkins instance for yet.

The core meta-par package and the original monad-par package do not depend on any recent GHC additions. Nightly testing results for those packages, on GHC 6.12, 7.0, and 7.2, can be found here.

Appendix: Documentation Links

Because Hackage can’t always build documentation for packages with foreign dependencies, here are some documentation links for the relevant packages:

Thursday, April 26, 2012

Dropbox wiki gone -- Why we little people must Clone the Cloud

Clone the cloud. It's better for everyone.

With github or with Google Drive this happens transparently. The user has a full copy of what's on the server. Github even does this for wiki and web page data, which is great, and some people create static HTML blogs using github and tools like Jekyll.

But unfortunately, most text that's poured into comments, forums, wikis, and blogs seems to have a dubious shelf life. The case in point is that Dropbox took down their wiki a while back, making this link in one of my previous posts dead. Poof. In this particular case I kept a copy in the text file I pasted it from, so I am republishing it below.

"Data liberation" doesn't quite fix the problem, either. Some companies are better than others about providing optional downloads of your data. Google, in particular, is very good. But laborious opt-in downloads (that most people don't use) aren't good enough. Always-on synchronization or mirroring is called for.

I'm optimistic. With the mass movement towards synchronization based approaches (SkyDrive, iCloud, Google Drive, etc) and VCS (github, bitbucket) I hope we are moving in the right direction. And why not? Everyone wins:

For the cloud service provider it means an extra backup copy that reduces the risk of angry customers if a software failure (or more likely, account security breach) destroys data. It also provides local caching, which always makes good engineering sense.
For the user, you get to access your data on the plane and know that it will live on when the company in question decides to retire the cloud service you've been using.

Potential GSOC: Tweak memory-reuse analysis tools for GHC compatibility

Some program instrumentation and analysis tools are language agnostic. Pin and Valgrind use binary rewriting to instrument an x86 binary on the fly and thus in theory could be used just as well for a Haskell binary as for one compiled by C. Indeed, if you download Pin from pintool.org, you can use the included open source tools to immediately begin analyzing properties of Haskell workloads -- for example the total instruction mix during execution.

The problem is that aggregate data for an entire execution is rather coarse. It's not correlated temporally with phases of program execution, nor are specific measured phenomena related to anything in the Haskell source.

This could be improved. A simple example would be to measure memory-reuse distance (an architecture-independent characterization of locality) but to distinguish garbage collection from normal memory access. It would be quite nice to see a histogram of reuse-distances in which GC accesses appear as a separate layer (different color) from normal accesses.

How to go about this? Fortunately, the existing MICA pintool can build today (v0.4) and measure memory reuse distances. In fact, it already produces per-phase measurements where phases are delimited by dynamic instruction counts (i.e. every 100M instructions). All that remains is to tweak that definition of phase to transition when GC switches on or off.

How to do that? Well, Pin has existing methods for targeted instrumentation of specific C functions. By targeting appropriate functions in the GHC RTS, this analysis tool could probably work without requiring any GHC modification at all.

A further out goal would be to correlate events observed by the binary rewriting tool and those recorded by GHC's traceEvent.

Finally, as it turns out this would NOT be the first crossing of paths between GHC and binary rewriting. Julian Seward worked on GHC before developing valgrind:

http://www.techrepublic.com/article/open-source-awards-2004-julian-seward-for-valgrind/5136747

[Note this proposal is also posted on the GSOC trac.]

Wednesday, February 8, 2012

Potential GSoC: Haskell Lock-free Data Structure Implementations

The GHC Haskell compiler recently gained the capability to generate atomic compare-and-swap (CAS) assembly instructions. This opens up a new world of data-structure implementation possibilities.

Furthermore, it's an important time for concurrent data structures. Not only is the need great, but the design of concurrent data structures has been a very active area in recent years, as summarized well by Nir Shavit in this article.

Because Haskell objects containing pointers can't efficiently be stored outside the Haskell heap, it is necessary to reimplement these data structures for Haskell, rather than use the FFI to access external implementations. There are already a couple of data structures implemented in the following library (queues and deques) :

https://github.com/rrnewton/haskell-lockfree-queue

But, this leaves many others, such as:

Concurrent Bags
Concurrent Hashtables
Concurrent Priority Queues

A good point of reference would be the libcds collection of concurrent data structures for C++ (or those that come with Java or .NET):

One of the things that makes implementing these data structures fun is that they have very short algorithmic descriptions but a high density of thought-provoking complexity.

A good GSoC project would be to implement 2-3 data structures from the literature, and benchmark them against libcds implementations.

UPDATE: Here's a link to the Trac ticket.

Tuesday, December 27, 2011

Wuala, SpiderOak, and Dropbox: Feature Summary and a Little Testing

Dropbox, SpiderOak, and Wuala seem to be the current contenders for someone who wants synchronized cloud storage together with Linux/Mac support. I've tried all three and come up with a few observations that may be of use.

First I did a little test of Dropbox, SpiderOak, and Wuala. I wrote to a text file on one machine inside my LAN and timed how long it took for the file to appear on the other machine. It took 5, 24, and 68 seconds respectively.

These times are not surprising given that Dropbox implements a "LAN sync" capability, whereas SpiderOak goes through the central server, and Wuala not only goes through the server (in Europe; I'm in Indiana) it further does not use inotify, i.e. it has to periodically scan the file system for changes.

But there are some major reasons to chose SpiderOak or Wuala over Dropbox. One of them is that both SpiderOak and Wuala have client-side encryption such that their employees shouldn't be able to access your files.

Further, there's the handling of symbolic links, which I have complained about before ***. Wuala syncs them. SpiderOak ignores them (which is still better than Dropbox, which follows them resulting in weird behavior).

Wuala also has some other unique desirable features as well as some major technical drawbacks.

Wuala used to allow a P2P mode in which users traded space on their own machines to become part of the storage network. A very neat concept and a good way to get a lot of cloud storage at a reasonable price. (By contrast, 3TB of storage on Amazon, to match one external HD, is $5000 a year).
Wuala has a time travel view that lets you see the previous state of an entire folder. Dropbox and SpiderOak have a single file fixation. You can view previous versions of an individual file using their respective UIs. This is great for, say, a word document, but very poor for a directory containing code or multiple latex files.
FUSE support. Wuala allows the remote drive to be directly mounted via FUSE without requiring everything to be sync'd locally. In theory this would seem to combine the benefits of something like Dropbox with traditional network file systems like NFS and AFS.

And then the drawbacks. Unfortunately these are drawbacks that strike at the core -- the bread and butter of syncing. First, as mentioned above, Wuala doesn't use inotify for to allow the OS to alert it of changed files. Second, Wuala doesn't allow exclusion of files based on name or extension -- a major drawback it shares with Dropbox which makes it inefficient to rebuild compiled projects inside a synced folder. (Note: Wuala also used to not support incremental/differential uploads. That has been implemented.)

In summary:

Wuala

+ Syncs symbolic links
+++ Time travel view
+ FUSE-mount remote drive without syncing
- No inotify support yet
- No exclusion of files from sync yet
- No systray icon / background mode under mac os yet
- [US] Data centers in Europe

SpiderOak

+ Ignores symbolic links (rather than doing something terrible)
+ Exclude files from sync via extension or regex

Dropbox

+ LAN sync
? More geographic coverage (amazon)

*** P.S. In an ironic act of data non-preservation it looks like the work-arounds I'd posted to the dropbox wiki (http://wiki.dropbox.com/Symlink%20Semantics%20and%20Workarounds) were lost because they took down the entire wiki.

Thursday, February 24, 2011

Dropbox - another way to shoot your foot

Whew, the data loss in the last post actually wasn't that bad because the folder was version controlled anyway (and stored on another server). I thought only using dropbox for working copies would keep me safe.

So much for that idea! In a sleepy state just now I managed to get Dropbox to delete scattered files outside of my Dropbox folder!

This comes back to the symlink semantics, but it can bite you in strange ways. Here's how you can really screw yourself. First, let's say you've got a working copy that stores all your notes and other reference text files that you search often:

   mv foobar ~/dropbox/foobar

Let's further suppose that you are on a Mac or Linux client and foobar contains symbolic links that point to other scattered text files elsewhere in your personal archives that you may want to have at hand (for grepping).

Next, you want to make a major change to foobar, so you back it up, just to be safe:

   cp -a ~/dropbox/foobar ~/dropbox/foobar_bak

A while later you come across this backup (on a different dropbox-linked machine) and delete it:

rm -rf foobar_bak

BAM! That's it, you just screwed yourself.

How? When Dropbox went back to your original Mac/Linux machine to delete foobar_bak, it decided to FOLLOW THE SYMLINKS. That's right, it deleted the original files. Even though the original foobar directory is still there, its links are now broken of course.

The whole point of these links was organizational. They pointed all over the place. Even if you have backups you now have to track them down and restore those scattered files. (Which is what I just spent my time doing.) I guess I'm a synchronization masochist because I seem to ask for this kind of punishment.

Bottom line. If you're a power user, avoid Dropbox or keep your usage very light, e.g. stick an office document in there every once in a while. Or be ready to suffer.

ParFunc