Scalability, do we care?


Scalability is often a property that machine learning practitioners hope to achieve. They have a dream that their techniques could perhaps solve some problem instances that are easy to solve and some more challenging ones. The No- Free-Lunch theorem suggests otherwise and many research aim at advancing techniques in this direction. My contribution demonstrates that using again some concepts widely used in finance can advise an automated design algorithm to identify some scalable metaheuristics for some real-life NP-hard and discrete problems.

Scalability, should we really care?

Is it worth considering in the design of any computer systems? In my opinion, the founders of computer science may have been aware of such issues; their designs are still very efficient. Historically scalability has been achieved adequately but it may not have been labelled as such.

For example, Unix operating system had a very humble starts; its versatile and simple design has made it the preferred choice for a good performance. It is true several rewrites were required. These refinements have lead to the fact that Unix can be installed to many computer platforms with only a small machine-dependent code. It has become hidden in our phones, smart watches, tablets and other big and minute computers.

Are you surprised? Think many current operating systems use Unix in its core; those include Android, IOS and Windows. More importantly, Unix-family operating systems often provide a solid backend and backbone to many organizations, some of them of quite considerable size. Computer clusters would not be able to provide tools to study the genomes, without it. Finally, Unix has been at the heart of Arpanet, a network that is nowadays referred as the Internet.

The Internet is based on some ideas first suggested in Bletchley by Alan Turing. Each element of a network has a unique address, that indicates the networks and sub-networks in which an element lies. Each element could exchange information with each other, by finding a route linking them together. Internet Protocol was developed to share military information between three cities in the United States during the Cold War with these ideas in mind. Academics tools the opportunity to exchange information between each other’s, instead. Once IP was coupled with the Transport Communication protocol, then many hardware and software manufacturers started collaborating and a better cohesion between networked computers occurs. Recently TCP/IP needed to be extended due to its success.

The World Wide Web had too a very humble start. Tim Berner-Lee developed a simple data formatting language to display information attractively. Then he developed a simple protocol that mimics a discussion between a server and a browsers. HTML and HTTP first share information within a small team of scientists. Nowadays it provides a method to shop online and interact with other human being across the globe. Social media, online shopping, e-business and e-commerce would not exist without the WWW.

Let’s program

Some programming languages have also demonstrated some abilities to scale. The first choice that comes to my mind is C. This well-known programming language has been used to write Unix, but also some smaller systems such as a satellite, radio systems, basic electronic circuits. It may remain still mainly a functional programming language but the versatility of its compilers and assemblers can adapt suitably to a wide range of computing platforms. Pascal was integrated into Delphi in the mid-1990s. In some parts of the world, it has become the preferred language for developing database systems.

It would be wrong not to mention C++. This programming language is evolving continually by companies like Morgan Stanley, Google and others companies. Its creator, professor Stroudup, describes it as an elephant that every C++ programmer perceives it in its own way. And yet it is highly successful to develop many applications for a wider range of background. However, biologists and other scientists may not endeavor to go through the first hurdle during the early stage of learning.

Java is a very young programming language that can produce highly interoperable programs. The Java Virtual machine interprets the code making it useable on many platforms. Python has become the preferred programming language used by biologists. Many bio-informatique systems rely on Python to find patterns in DNA and RNA sequences. Those are measured in gigabytes. The simplicity of the language suits well many programmers who do not wish to worry about pointers, and memory allocations. Finally R a free programming language can process some mighty statistics.

Databases, surely those must be scalable systems

DBMS have been designed to work from using a few records to millions. SQL can manipulate a lot of information from a simple expression and some statistics can be achieved very easily. The beauty of these systems lies by applying set theories to some structure data. Indexes and the normal forms help retrieving the information very quickly, from many platforms.

This simple idea is not really fancy and nor existing for many information technologists. As a result, we have attempted to bring these systems designed for large computing to our personal computer; Microsoft Access, Derby and MySQL have appeared from this movement. While MySQL has demonstrated to store the data from the biggest retailers on the planet (i.e. Amazon), the other two appear struggling behind with the same ability to scale well to larger applications and multi-user connections.

It is a real shame that such a good idea has yet to be improved by buffering techniques such as Hibernates. Those try to bring an OOP approach to the set theories. While it may work well for small sets of data, larger databases may struggle and affect dramatically the performance. I am interested to witness how real OOP database management systems may be adopted widely in the future.

So what makes a computer system scalable?

Many of these examples can be expressed with simply and therefore lowering the barrier to comprehension. As a result, many people would have read about it and be able to reproduce these systems for their own purposes. Some consortiums would then adapt the systems to meet new demands and disseminate the information. Corporations can prevent this healthy exchange of information to protect their assets, which is understandable.

Many of these examples uses an extensive variety of hardware. Yet the open-source nature of these systems has permitted quite an impressive interoperability. The costs may have shifted towards the technological expertise rather than the licensing. Some organizations can improve the systems to their own needs. However, some organizations do write again commercial products to make them safer. This requires quite a bit of legal negotiations.

These systems can achieve suitably small tasks and more demanding ones too. Their rudimentary architecture has led to be used in very small organizations and a large proportion of computer users. Less elements can break or they can be combined in a such way they become complex systems themselves. Their high-level of specificity should help designing object or services with more ease.

Also, these systems are well-documented. Their transparency contribute in their wide adoption and adaptation to new contexts.

It is time we, computer scientists, start designing again some techniques and computer systems that have a well-defined and smaller architectures. Then it becomes easier to communicate their concepts to our peers and use such systems in a much larger scale without affecting negatively their architecture. Transparency is good!

, ,

One response to “Scalability, do we care?”

Leave a comment