Tuesday, December 27, 2011

Wuala, SpiderOak, and Dropbox: Feature Summary and a Little Testing

Dropbox, SpiderOak, and Wuala seem to be the current contenders for someone who wants synchronized cloud storage together with Linux/Mac support. I've tried all three and come up with a few observations that may be of use.

First I did a little test of Dropbox, SpiderOak, and Wuala. I wrote to a text file on one machine inside my LAN and timed how long it took for the file to appear on the other machine. It took 5, 24, and 68 seconds respectively.

These times are not surprising given that Dropbox implements a "LAN sync" capability, whereas SpiderOak goes through the central server, and Wuala not only goes through the server (in Europe; I'm in Indiana) it further does not use inotify, i.e. it has to periodically scan the file system for changes.

But there are some major reasons to chose SpiderOak or Wuala over Dropbox. One of them is that both SpiderOak and Wuala have client-side encryption such that their employees shouldn't be able to access your files.

Further, there's the handling of symbolic links, which I have complained about before ***. Wuala syncs them. SpiderOak ignores them (which is still better than Dropbox, which follows them resulting in weird behavior).

Wuala also has some other unique desirable features as well as some major technical drawbacks.
  1. Wuala used to allow a P2P mode in which users traded space on their own machines to become part of the storage network. A very neat concept and a good way to get a lot of cloud storage at a reasonable price. (By contrast, 3TB of storage on Amazon, to match one external HD, is $5000 a year).
  2. Wuala has a time travel view that lets you see the previous state of an entire folder. Dropbox and SpiderOak have a single file fixation. You can view previous versions of an individual file using their respective UIs. This is great for, say, a word document, but very poor for a directory containing code or multiple latex files.
  3. FUSE support. Wuala allows the remote drive to be directly mounted via FUSE without requiring everything to be sync'd locally. In theory this would seem to combine the benefits of something like Dropbox with traditional network file systems like NFS and AFS.
And then the drawbacks. Unfortunately these are drawbacks that strike at the core -- the bread and butter of syncing. First, as mentioned above, Wuala doesn't use inotify for to allow the OS to alert it of changed files. Second, Wuala doesn't allow exclusion of files based on name or extension -- a major drawback it shares with Dropbox which makes it inefficient to rebuild compiled projects inside a synced folder. (Note: Wuala also used to not support incremental/differential uploads. That has been implemented.)

In summary:

Wuala
SpiderOak
  • + Ignores symbolic links (rather than doing something terrible)
  • + Exclude files from sync via extension or regex
Dropbox
  • + LAN sync
  • ? More geographic coverage (amazon)

*** P.S. In an ironic act of data non-preservation it looks like the work-arounds I'd posted to the dropbox wiki (http://wiki.dropbox.com/Symlink%20Semantics%20and%20Workarounds) were lost because they took down the entire wiki.


Thursday, February 24, 2011

Dropbox - another way to shoot your foot

Whew, the data loss in the last post actually wasn't that bad because the folder was version controlled anyway (and stored on another server). I thought only using dropbox for working copies would keep me safe.

So much for that idea! In a sleepy state just now I managed to get Dropbox to delete scattered files outside of my Dropbox folder!

This comes back to the symlink semantics, but it can bite you in strange ways. Here's how you can really screw yourself. First, let's say you've got a working copy that stores all your notes and other reference text files that you search often:
   mv foobar ~/dropbox/foobar
Let's further suppose that you are on a Mac or Linux client and foobar contains symbolic links that point to other scattered text files elsewhere in your personal archives that you may want to have at hand (for grepping).

Next, you want to make a major change to foobar, so you back it up, just to be safe:
   cp -a ~/dropbox/foobar ~/dropbox/foobar_bak
A while later you come across this backup (on a different dropbox-linked machine) and delete it:

rm -rf foobar_bak

BAM! That's it, you just screwed yourself.

How? When Dropbox went back to your original Mac/Linux machine to delete foobar_bak, it decided to FOLLOW THE SYMLINKS. That's right, it deleted the original files. Even though the original foobar directory is still there, its links are now broken of course.

The whole point of these links was organizational. They pointed all over the place. Even if you have backups you now have to track them down and restore those scattered files. (Which is what I just spent my time doing.) I guess I'm a synchronization masochist because I seem to ask for this kind of punishment.

Bottom line. If you're a power user, avoid Dropbox or keep your usage very light, e.g. stick an office document in there every once in a while. Or be ready to suffer.

A New Dropbox Nightmare


I was cautiously optimistic and hopeful -- "maybe Dropbox is the synchronization solution I've been waiting for" -- but that sentiment is quickly being replaced with jaded.

First, it is good to recognize that backup and synchronization (mirroring) are orthogonal. Mirroring only insures that mistakes get propagated. Well, I just experienced my second Dropbox spontaneous deletion event, which, due to the miracle of synchronization, was applied to all of my computers and to the cloud.

What can be worse than a silent deletion event for a service like Dropbox? Nothing that I can think of. I can't babysit the 96,000 files in my Dropbox to catch something like this.

Now the details. I came across a directory that had 95% of the files deleted (leaving a random few remaining). Scrolling back through the event history I found that this happened after linking a new client, which happened to be a Windows virtual machine. Below is a snippet of the event audit trail. The entire series of events happening at 10:09 was spurious (including deleting 411 files and 206 folders).
You edited the file .saves-25811-rnmacbook13~.      11:27 AM
You renamed the file intel_aes.c to test038.h.ERR. 10:09 AM
You renamed the file setup to test047.h.ERR. 10:09 AM
You added 686c9fad58174636354 and 78 more files. 10:09 AM
You edited simple_cycle.out and 49 more files. 10:09 AM
You deleted 6c88b2f3aaaa585db12 and 411 more files. 10:09 AM
You added eb and 6 more folders. 10:09 AM
You deleted intel_aes_lib and 206 more folders. 10:09 AM
You moved 04 and 24 more folders. 10:09 AM
You added the folder fabl. 12:42 AM
The computer rn-waspvm was linked to your account 12:24 AM
You edited the file ghc_setup.txt. 12:24 AM

A couple additional oddities show up in the above:
  • it waited over 9 hours after the new client was linked to go crazy
  • Files were "'renamed", intel_aes.c to test038.h.ERR Huh? Is that a remove being aliased with an add to create a fake rename? Possibly a secondary bug.
I went back and checked out the state of that Windows client just now. A clue! The only files it successfully downloaded were the ones not deleted in the above nonsense. So Dropbox confused a partially downloaded directory with a new version of the directory (deleting most of the contents). Unison has never done anything like that to me in ten years of heavy use!

It may be corroboration that the Windows client currently thinks it "hasn't connected yet" [Screenshot below]. Oh really?


Well, perhaps this is "fool me twice, shame on me". I should have known better -- my wife tried Dropbox about a year ago, dumped her ~50 gigabyte personal archive in all at once, then moved a folder, and ended up with her data in a never-fully-sychronized, confused state. It still hasn't been completely fixed; there's a backup, but also new stuff mixed into the messed up Dropbox version. I chalked that one up to bad behavior while under load and before the first sync is successful (plus the dangers of syncing being too implicit and automatic). Still, I wanted to give Dropbox another chance; trying it myself this time, thinking that:
  • It's surely improving and ...
  • I would be more cautious, spoon-feed it, and pay more attention to its syncing state.
But clearly that didn't save me.

Tuesday, February 15, 2011

Dropbox semantics - oh that there were such a thing

After some recent Dropbox problems I've been having (not the first unfortunately) I added an entry on the Dropbox wiki explaining what happens if you are so impertinent as to add directories containing symbolic links to your Dropbox folder:


[UPDATE - dropbox took down their wiki, but I reposted the page here.]

One day maybe we'll have something with the correctness and robustness of Unison and the convenience of Dropbox. People will dance in the streets. I think Dropbox can be that solution if they work at it.

UPDATE - these issues are explained at greater length in this post:

http://aurelio.net/articles/dropbox-symlinks.html

Thursday, October 14, 2010

Hacking together a working version of Haskell Platform 2010.2 for GHC 7.1

I don't know if GHC developers typically attempt to bring up a full, cabal-install based setup around development versions of the compiler, but that's what I wanted. To that end I hacked up a modified version of Haskell Platform 2010.2 that works with GHC 7.1.20101014. This is a recent version of GHC that includes, among other things:
  • Major run time improvements: namely, a BLACKHOLE fix that greatly improve parallel performance in some cases, relative to 6.12.
  • The LLVM backend.
  • The new type inference engine.
The modified version of Haskell Platform can be downloaded from the link below. You should be able to build it against a recent GHC that you can grab from the nightly snapshots (try 2010.10.14 to be safe). If that doesn't work, you can download my own build of the compiler below (Linux 64 bit) which is inflexible and needs to be unpacked in /opt/. I'm including links to download it with or without the platform pre-installed:
After that you should be ready for installing other software via "cabal install".

The modified haskell platform tarball contains my notes on the hacks that were necessary, copied below:



I did a few hacks to get this working with GHC 7.1.20101014

First was to upgrade the MTL package to a slightly newer version,
downloaded from Hackage. Ditto for deepseq.

The second was to hack the .cabal file in haskell-src to be less
picky about upperbounds on version numbers.

But I still ran into compile problems with haskell-src. (And there's
no new version of it released at the moment to upgrade to.) Namely, I
got the following compile error:

Language/Haskell/Syntax.hs:67:8:
Could not find module `Data.Generics.Instances':
Use -v to see a list of the files searched for.

Ok, I included "syb" in the list of packages.

Next, build error on quickcheck... upgraded to 2.3.0.2
But that didn't fix the problem -- "split" is still undefined.

The package 'parallel' had to be loosened up to tolerate newer containers versions.

HTTP complained about wanting base 3... but why was the "old-base"
flag set anyway?

Ditto zlib.

Finally, the cabal-install package also required some relaxation of
version numbers, and worse it seems like the type of a library
function has changed from a Maybe result to a definite one.

Here I had to make a choice between updating to cabal-install 0.9.2
or hacking the 0.8.2 version. It works to add an "fmap Just" to get
0.8.2 to build, and besides the 0.9.2 version I have is actually just
the darcs head -- there hasn't been a release yet.

After picking the hacked 0.8.2 version, `make` succeeded for my
modified haskell-platform-2010.2.

Tuesday, June 29, 2010

Good ol' FP infighting

A Schemer by upbringing, I know all about infighting!

It seems that there's been an ongoing back and forth between certain ML and Haskell partisans about performance predictability. Further, there's been the accusation that Haskell has horrible hash table performance. (Haven't ran these tests myself yet, but I'd like to.)

Anyway, the Haskell CnC project that I announced on the Intel blogs has been garnering some feedback. It got panned particularly brutally by Jon Harrop here. I don't know, maybe that's just what Jon likes to do. In any case, I wrote a response in the comments and it grew big enough that I thought I'd make it into a post (below).




Response to: http://flyingfrogblog.blogspot.com/2010/06/intel-concurrent-collections-for.html

I think there's a bit of a misunderstanding here in the initial conflation of CnC with "graph reduction". CnC is a programming model based on static graphs (no reduction) like synchronous data flow. (I recommend the papers on StreamIT in this area.)

If anything, you could view this as backing down from some of the traditional Haskell implementation decisions. The user makes decisions as to the granularity of the parallel tasks (the granularity of the nodes in the graph), and CnC eschews lazy evaluation during graph execution.

I guess I feel like I've been caught in the middle of some existing flame wars wrt hash tables and performance predictability. I'm aware of the existing battle over Haskell hash tables. I have no problem with hash tables. The Haskell CnC implementation makes no use of persistence and can use mutable map data structures just as well as immutable. I use Haskell hash tables, but as they don't support concurrent update (and aren't high performance to start with) they're not the default. I'd love better hash tables. In other implementations we use TBB hashmaps.

Regarding performance predictability -- sure, it's a problem, both because of lazy evaluation and dependence on high level (read fragile) optimizations. But I don't see why this needs to be an argument. People who need great predictability today can use something else. Haskellers should continue to try to improve predictability and scaling. What else can they do? Even in this little CnC project we have made progress -- complex runtime, quirks and all -- and have Haskell CnC benchmarks getting parallel speedups over 20X. Not perfect, but improving. (I'll give your raytracing benchmark a try sometime, if you like; that sounds fun.)

I also don't think it's a good idea to endorse Dmitriy Vyukov "kindly" dismissing all of Haskell and Erlang. This is an even more extreme position than Jon's own "doesn't scale beyond a few cores" position.

Jon, I'm amenable to your arguments and not necessarily in a different camp (I'm an eager functional programmer more than a lazy one), but I would appreciate not being so cursorily panned!

P.S. (Regarding "the only problem that we've solved" being mandelbrot. Indeed that would be bad if it were so. Please see the paper. We don't have nearly as many benchmarks as I'd like yet, but we're doing Cholesky factorization, the Black-Scholes equation, and the N-body problem at least. Also the paper has some more in-depth discussions about scheduling. None of the current schedulers, by the way, are yet considered by me to be anywhere near an endpoint. Many low hanging fruit remain.)

P.P.S We also have a CnC implementation in F#.

I guess a blog wouldn't hurt.

I've been posting a bit on the Intel blogs recently. Yet I guess there will always be posts that aren't appropriate in that context, so why not create one of these new fangled Weblogs?