Bosonit TechXperience | "State of the Art of Data" by Francisco Javier Salvado
The second session of the TechXperience was hosted by our colleague Francisco Javier SalvadoBig Data Engineer at Bosonit, with the topic "Experiences in Big Data projects", in which he reviewed the basics of Big Data. Big Data, and their different applications to later put them into context in their customer experience. An enriching talk that we bring up today to collect the most interesting points of it and to know the state of the art of data.
Who is Francisco Javier Salvado?
First of all we must introduce you to the protagonist of the session: Francisco Javier Salvado. Born in Chiclana, Cadiz, and teleworking there since the beginning of the pandemic. He holds a degree in Telecommunication Technologies Engineering from the University of Seville and a Master's degree in Analysis and Visualisation of Massive Data from the UNIR.
The positions he has worked in are Big Data Engineer, Big Data Developer and, in my early days, Full Stack Developer. In terms of clients, his first experience was working for the Junta de Andalucía in Fujitsu, which I'm sure you know for making air conditioners, but it is also a fairly large technology consultancy. There she was part of a team within the Progreso y Salud Foundation, getting started in web programming with Java.
His goal was to dedicate himself to the world of Big DataWhen he had the opportunity, he joined Bosonit to train first in Logroño and then to apply his knowledge and talent to client projects.
State of the Art of Data
Before delving into the state of the art of data, I am sure most of you will have done some research on what Big Data is at some point and with a very high probability will have come up with the 'V's of Big Data, a series of 5 characteristics which are as follows: Volume, Variety, Velocity, Variety and Value. But the reality is that it can all be summed up in a single V: Volume. And, as the word itself indicates, if we are talking about Big Data we are talking about big data and large volumes of data.
It is also worth noting that what seems big to us today will be small in the future, and in the past it seemed much bigger. To see this, it is enough to think about what a Gigabyte means to us today compared to what it meant to us a decade ago. Therefore, we could conclude that Big Data is an ambiguous and evolving term so far. To give a real figure and to put into perspective the volume supported by one of these systems, I can say that in the project I am currently working on we have 100GB of memory available in pre-production and 350GB in the production environment. This memory is what Spark uses to do the work of processing the data.
With this in mind, if we talk about Big Data, we should also talk about distributed computing, as traditional systems are not able to perform the task with the agility needed to adapt to the demands of the moment.
But distributed computing should not be confused with parallel computing. In distributed computing, memory is separated among the different processors, while in parallel computing, memory is shared by all processors. This is something fundamental since, in distributed computing, when we work with Spark, a phenomenon called 'Shuffle' occurs; a transfer of information between the different memories of each node and a fundamental part when programming.
Working environments for Big Data management
If we look back in time, the World Wide Web was developed in the 1990s, Google appeared in the late 1990s and Facebook was founded in 2004. Without realising it, it could be said that we were inventing the car before the wheel, as the exponential increase in the number of internet users made it necessary to develop technologies that would provide efficient solutions for managing such a voluminous amount of data.
2006 saw the launch of Apache Hadoop. A distributed computing framework that we are all familiar with today and which was inspired by MapReduce and Google File System. Eight years later, in 2014, Apache Spark was born. Its birth meant a radical leap in quality by increasing performance a hundredfold, among other things, because it manages to work in memory instead of on disk.
In connection with this event, it is no coincidence that 2013 saw the publication of a report on the concept of Industry 4.0 or the fourth industrial revolution, which is based, among other disruptive technologies, on the Big Data.
Within the Apache Hadoop project and linking in with what was said earlier about the creation of Facebook and the rise of social networks, it is worth mentioning, for example, HBase. A column-oriented, key-value database launched in 2006 and modelled on Google BigTable. In 2010, Facebook would make use of HBase to implement new messaging services.
On the other hand, I have used HBase within banking projects where the requirements of the project itself demanded a random update of records. For this purpose HBase is much more efficient than the typical Hive table in parquet format over HDFS.
Programming languages supported by Spark
As programming languages supported by Spark, we have Scala, Python, Java and R. Among the 4 mentioned, I will talk a little about Scala because besides being the language with which I work, personally I would say that it is the most characteristic since Spark itself is mainly written in that language.
Scala is a multi-paradigm programming language. It supports both object-oriented and functional programming features, so we can take the best of both worlds for our applications. The main features of functional programming are as follows:
- Higher order functions.
- Pure functions.
- Non-strict evaluation.
- Use of recursion.
Among these four, I would highlight the use of pure functions (functions that for the same input will always give the same output), which will provide our code with superior cleanliness by improving code testing. Finally, if we talk about future projection, we are already making the jump from on-premise a cloud. My view is that we will have hybrid architectures with clear tendencies towards microsystems and serverless on demand.