large number of Linux machines (>500) and hit quite a few problems:
1. Seeding: initially, because they started more or less at the same time,
a lot of the slaves ended up with an identical seed, which IIUC slows
Monte-Carlo convergence an awful lot.
To fix this, I had to start slaves staggered in time with a 2 second
delay because the seeding scheme seems to be using a timer with
a resolution of 1 second.
This means it takes about 10 mn just to get the whole thing up to full
speed. It's five times longer than the render would take if I could
start the whole cluster on at once
It would be nice to either be able to specify seeds on the command
line (i.e. a value from /dev/random) or have the master distribute
the seeds.
2. The master doesn't scale. If you want to run Indigo slaves on more than
500 boxes, the master crashes with a std::alloc() exception around
the ~450th slave connection.
Even if that got fixed, the master seem to keep a TCP connection to
each slave open at all times. This is bound to break past a certain
scale because on most OS'es you can only have so many TCP
connections in flight per process.
First, I would suggest that the slaves open connections to the master
only when they need to upload, and then tear it down once the upload
completes.
Second, I would suggest adding an 'aggregator' mode to Indigo
that would allow a master to talk to a reasonable number of slaves,
aggregating their result and uploading it to a "super" master. If I
understand the metropolis algorithm correctly, it's just a matter
of the master uploading a "weighted" image upstream.
Doing this would allow for a hierarchy of aggregators and let
you scale the size of a network of Indigo slaves to throusands
of machines easily.
3. Beyond a certain number of machines, when a slave gets wedged
for a reason or another (i.e. when it loses connection to the master),
it just hangs instead of dying with an error code.
This makes it kind of hard to automate slave management. I had
to go through hoops writing a wrapper that monitors the slave's CPU
consumption and kills + restarts them when that goes down to zero
for over a minute.
4. I am relatively new to Indigo, so apologies if I have missed this
feature, but it would be nice if the master could write checkpoints
to disk. For very long renders, a number of things can go wrong
(power failure, etc ...). It'd be nice to be able to restart where
you left off (again, if I understand how Indigo works, it is just a
matter of saving the current image with its "weight" and being able
to reload the whole thing on restart)
Voila, my $.02 worth of gripes with Indigo
Other than that, kudos to Ono for putting Indigo
together, it's a real nice tool.
PS: For kicks, here's what I rendered at ~16k samples/pixel this
afternoon using 400 quad-cores boxes:
http://www.indigorenderer.com/joomla/in ... emId=29574
It took that much oomph because the scene is about as bad as it
gets for spatial acceleration structures.
