[REQ] Network rendering aggregators

mgix · Post by **mgix** » Thu Jun 26, 2008 11:31 am

This afternoon, I tried to launch an indigo network render on a relatively
large number of Linux machines (>500) and hit quite a few problems:

1. Seeding: initially, because they started more or less at the same time,
a lot of the slaves ended up with an identical seed, which IIUC slows
Monte-Carlo convergence an awful lot.

To fix this, I had to start slaves staggered in time with a 2 second
delay because the seeding scheme seems to be using a timer with
a resolution of 1 second.

This means it takes about 10 mn just to get the whole thing up to full
speed. It's five times longer than the render would take if I could
start the whole cluster on at once

It would be nice to either be able to specify seeds on the command
line (i.e. a value from /dev/random) or have the master distribute
the seeds.

2. The master doesn't scale. If you want to run Indigo slaves on more than
500 boxes, the master crashes with a std::alloc() exception around
the ~450th slave connection.

Even if that got fixed, the master seem to keep a TCP connection to
each slave open at all times. This is bound to break past a certain
scale because on most OS'es you can only have so many TCP
connections in flight per process.

First, I would suggest that the slaves open connections to the master
only when they need to upload, and then tear it down once the upload
completes.

Second, I would suggest adding an 'aggregator' mode to Indigo
that would allow a master to talk to a reasonable number of slaves,
aggregating their result and uploading it to a "super" master. If I
understand the metropolis algorithm correctly, it's just a matter
of the master uploading a "weighted" image upstream.

Doing this would allow for a hierarchy of aggregators and let
you scale the size of a network of Indigo slaves to throusands
of machines easily.

3. Beyond a certain number of machines, when a slave gets wedged
for a reason or another (i.e. when it loses connection to the master),
it just hangs instead of dying with an error code.

This makes it kind of hard to automate slave management. I had
to go through hoops writing a wrapper that monitors the slave's CPU
consumption and kills + restarts them when that goes down to zero
for over a minute.

4. I am relatively new to Indigo, so apologies if I have missed this
feature, but it would be nice if the master could write checkpoints
to disk. For very long renders, a number of things can go wrong
(power failure, etc ...). It'd be nice to be able to restart where
you left off (again, if I understand how Indigo works, it is just a
matter of saving the current image with its "weight" and being able
to reload the whole thing on restart)

Voila, my $.02 worth of gripes with Indigo

Other than that, kudos to Ono for putting Indigo
together, it's a real nice tool.

PS: For kicks, here's what I rendered at ~16k samples/pixel this
afternoon using 400 quad-cores boxes:

http://www.indigorenderer.com/joomla/in ... emId=29574

It took that much oomph because the scene is about as bad as it
gets for spatial acceleration structures.

zsouthboy · Post by **zsouthboy** » Thu Jun 26, 2008 11:51 am

Holy crap that's a lot of slaves!

No one's really run into the issues you have before because... no one has network rendered on that scale with Indigo before (perhaps the RANCH has now, though)

I don't have anything to actually add to help you, though

Post by **OnoSendai** » Thu Jun 26, 2008 11:52 am

Hi,

1)
from 1.1.1 changelog:
* added -seed for setting the RNG seed, added RNG seed transmission for network rendering.

2) I think past a certain number of nodes, the simple master/slave architecture is bound to fail, due to the single point of failure / bottleneck that is the master.

The 'reduce' part of the network rendering problem is basically that of merging frame buffers (or IGI's, as they're known on disk).
The merging process is rather simple, basically it's a matter of adding all the frames together. Since addition is associative and commutative, the merging can be done in any order, and partial merges can be used.

Perhaps an optimal network/merging topology would be a n-ary tree, where each interior node merges the frames from its N children, and passes the merged frame up to it's parent.

Note that it's quite possible for you to do the merge process yourself, externally from Indigo, and then use the --tonemap option of Indigo to tonemap the final merged IGI.

4) This is quite possible, make sure you have enabled IGI output.
You can then resume on the master using the -r command line option. (see manual for more details)

mgix · Post by **mgix** » Thu Jun 26, 2008 12:17 pm

OnoSendai wrote:Hi,

1)
from 1.1.1 changelog:
* added -seed for setting the RNG seed, added RNG seed transmission for network rendering.

That's very good news, but there doesn't seem to be
any Linux version newer than 1.0.9 on the site, so no cigar.
(and installing Wine on 500 Linux boxes is _truly_ not an
appealing option). I guess I'll just have to wait

OnoSendai wrote: 2) I think past a certain number of nodes, the simple master/slave architecture is bound to fail, due to the single point of failure / bottleneck that is the master.

True.

OnoSendai wrote: The 'reduce' part of the network rendering problem is basically that of merging frame buffers (or IGI's, as they're known on disk).
The merging process is rather simple, basically it's a matter of adding all the frames together. Since addition is associative and commutative, the merging can be done in any order, and partial merges can be used.

But you still need to normalize (divide by the number of samples) to
produce the final image, right ?

I guess the intensities in the IGI frame buffers are never normalized
and the number of samples used to create a given IDI is kept in the file,
and that to combine two IGI, you add both the pixel intensities and the
number of samples in each IGI. Neat.

How many bits (32, 64, 128)/what format (fixed point, float) do you
use to store intensity values (just want to know how far I can push it).

OnoSendai wrote: Perhaps an optimal network/merging topology would be a n-ary tree, where each interior node merges the frames from its N children, and passes the merged frame up to it's parent.

That's what my suggestions was, but now that I've realized that
you have the IGI feature, I can cobble that together myself. This
is actually pretty sweet: with the -r thing you describe below, it
gives me both the checkpointing and the scalability.

OnoSendai wrote: Note that it's quite possible for you to do the merge process yourself, externally from Indigo, and then use the --tonemap option of Indigo to tonemap the final merged IGI.

Yep. Thanks for pointing that out, and sorry for not reading the
manual better.

Thanks for the help !

Post by **OnoSendai** » Thu Jun 26, 2008 12:24 pm

You are correct, pixel values are unnormalised. The normalisation constant can be calculated from num_samples, which is kept with the frame buffer.
Normalisation is done at tonemapping time by Indigo.

You can check out the IGI format here:

http://www.indigorenderer.com/joomla/fo ... php?t=1268

No problem about not reading the manual better, --tonemap is too new to be in the manual (The manual only covers up to 1.0 stable)

[REQ] Network rendering aggregators

[REQ] Network rendering aggregators

Who is online