If you need help
A SHIFT in Outlook
Ben Segal , IT/PDP
How the all-powerful IBM and Cray mainframes were replaced by nasty distributed Unix boxes.
I don't remember exactly who first proposed running physics batch jobs on a Unix work station, rather than on the big IBM or Cray mainframes which were doing that sort of production work in 1989 at CERN. The work station in question was to be an Apollo DN10000, the hottest thing in town with RISC CPU's of a formidable 5 CERN Units each and costing around 150,000 Sfr. for a 4-CPU box. It was the combined idea of Les Robertson, Eric McIntosh, Frederic Hemmer, Jean-Philippe Baud, myself and perhaps some others working at that time around the biggest Unix machine that had ever crossed the threshold of the Computer Centre (a Cray XMP-48, running UNICOS).
At any rate, when we spoke to the Apollo sales people of our idea, they liked it so much that they lent us the biggest box they had, a DN10040 with four CPU's plus a staggering 64 MBytes of memory and 4 Gigabytes of disk. Then, to round it off, they offered to hire a person of our choice for three years to work on the project at CERN.
In January 1990, the machine was installed and our new hire, Erik Jagel, an Apollo expert after his time managing L3's Apollo farm, had coined the name "HOPE" for the new project. (HP had bought Apollo, OPAL had expressed interest, so it was to be the "Hp Opal Physics Environment"). We asked Jean-Claude Juvet where we could find the space to install HOPE in the Computer Centre. We just needed a table with the DN10040 underneath and an Ethernet connection to the Cray - for getting access to the tape data. He replied: "Oh, there's room in the middle" - where the recently obsolete round tape units had been - so that's where HOPE went, looking quite lost in the huge computer room with the IBM complex on one side and the Cray on the other.
Soon the HOPE cycles were starting to flow. The machine was surprisingly reliable and porting the big physics FORTRAN programs was easier than expected. After around six months the system was generating 25% of all the CPU cycles in the Centre, which became known to the management when we included HOPE's accounting files into the weekly report that plotted these things into nice histograms for easy reading.
We were really encouraged by this success and went to work on a proposal to extend HOPE in a systematic way. The idea was to build a scalable version from interchangeable components: CPU servers, disk servers and tape servers, all connected by a fast network and software to create a distributed mainframe. "Commodity" became the keyword - we would use the cheapest building blocks from those manufacturers who had the best price-performance for each function at any given time.
How big could such a system be built and what would it cost? We asked around and received help from many colleagues for a design study. A simulation was done of the work flow through such a system, bandwidth requirements were estimated for the fast network "backplane" needed to connect everything, the prices were calculated, the required software was sketched out and the manpower for development and operation was predicted.
The software development would be quite a challenge. Fortunately, some of us had been working with Cray at CERN, adding to Unix some missing facilities vital for mainframe computing: a proper batch scheduler and a tape drive reservation system, for example. These facilities could be reused quite easily. Other functions required would be a distributed "stager" and a "disk pool manager". These would allow pre-assembly of each job's tape data, read from drives on tape servers, into efficiently managed disk pools located on disk servers, ready to be accessed by the jobs in the CPU servers. Also new would be "RFIO", a remote file I/O package to offer a unified and optimized data transfer service between all the servers via the backplane, looking like NFS but with much higher efficiency.
Finally a suitable name was coined for the project, by Erik Jagel again: "SHIFT". It stood for "Scalable Heterogeneous Integrated FaciliTy", a bit forced for the final "T" but inescapably suggesting the paradigm shift occurring in large scale computing: away from mainframes and towards a distributed low-cost approach. The "SHIFT" proposal report was finished in July 1990. It had ten names on it, of the colleagues from several groups that had helped with their ideas and worked on the document. It was sent to the CN management and then we waited.
Our surprises began. It seemed that not everyone thought this was a good idea after all. Enquiries were received from Director level: had there really been ten people working on such a scheme? How many precious Cray resources were being used to feed this HOPE gadget - and how many of HOPE's claimed 25% CPU cycles were in fact Cray cycles? In fact the Cray had been used only as a convenient tape server, because it was the only Unix machine in the Computer Centre with access to the standard tape drives, all physically connected to the IBM mainframe at that time.
The CN Divisional management was more open. We were told the following: if we could persuade at least one of the four LEP experiments to invest in our idea we would have matching support from the Division. The search began. We spoke to ALEPH, but they replied "No thank you, we're quite happy with our all-Vax VMS approach". L3 replied "No thanks, we have all the computing power we need". DELPHI replied "Sorry, we've no time to look at this as we're trying to get our basic system running".
Only OPAL took a serious look. They had already been the partner in HOPE and also had a new collaborator from the University of Indiana with some cash to invest and some SCSI disks for a planned storage enhancement of their existing VMS based system. They would give us these contributions until March 1991, the next LEP startup, on condition that everything was working by then or else we'd have to return their money and their disks. It was September 1990, and there was a lot of work to do.
I believe that our modular approach and use of the Unix, C language, TCP/IP and SCSI standards were the keys to the very short development timescale we achieved. The design studies had included technical evaluations of various work station and networking products. By September, code development could begin and the first orders for hardware went out. The first tests on site with SGI Power Series servers connected via UltraNet took place at the end of December 1990. A full production environment was in place by March 1991, the "go-nogo" date set by OPAL.
And then we hit a problem. The disk server system began crashing repeatedly with unexplained disk errors. Our design evaluations had led us to choose a "high-tech" approach: the use of symmetric multiprocessor machines from SGI, for both CPU and disk servers, connected by the sophisticated "UltraNet" Gigabit network backplane. One supporting argument had been that if the UltraNet failed or could not be made to work in time, then we could put all the SGI CPUs and disks together into one cabinet and ride out the OPAL storm. We hadn't expected any problems in a much more conventional area, the SCSI disk system.
Our disks were mounted in trays inside the disk server, connected via high performance SCSI channels. It looked "standard", but we had the latest models of everything. Like a Ferrari it was a marvel of precision but a pig to keep in tune. We tried everything but still it went on crashing and we finally had to ask SGI to send an engineer over from Mountain View. He found the problem: inside our disk trays was an extra metre of flat cable which had not been taken into account in our system configuration. We had exceeded the strict limit of 6 metres for single ended SCSI, and in fact it was our own fault. Instead of SGI charging us penalties and putting the blame where it belonged, they lent us two extra CPUs to help us make up the lost computing time for OPAL and ensure success of the test period.
At the end of November 1991, a satisfied OPAL doubled their investments in CPU and disk capacity for SHIFT. At the same time, sixteen of the latest HP 9000/720 machines, each worth 10 CERN Units of CPU, arrived to form the first Central Simulation Facility or "Snake Farm" on account of the machine's code name within HP. The stage was set for the end of the big tidy mainframes at CERN and the much less elegant but evolving scene we see today on the floor of the CERN Computer Centre. SHIFT became the basis of LEP-era computing and its successor systems are set to perform even more demanding tasks for LHC, scaled this time to the size of a world-wide Grid.
About the author(s):
I graduated in Physics and Mathematics in 1958 from Imperial College London, then worked for 7 years on fast breeder reactor development, first for the UK Atomic Energy Authority and later in the USA for the Detroit Edison Company. I've been at CERN since 1971, after finishing my Ph.D. at Stanford University in Mechanical and Nuclear Engineering.
Except for a sabbatical in 1977, when I worked at Bell Northern Research in Palo Alto on a PABX development project (and encountered Unix for the first time), CERN has kept me pretty busy on various projects, including the coordinated introduction of the Internet Protocols at CERN beginning in 1985 (see: "A Short History of Internet Protocols at CERN"). For more details on my work, see my CV and some associated Notes, together with a partial List of Publications.