Do I need a database?

Teemu Maatta
6 min readSep 10, 2019

--

Datababases such as SQL, are wonderful for queries and reporting needs, but it is not always easy to decide, if a new database is required or not.

Photo by Markus Spiske on Unsplash

Benefits of a database

We could store data directly within an excel spreadsheet, so let’s first think why we would need a database? There are several benefits listed:

  • Centralized location for all employees: easy access, fast queries, multiple users, flexibility, errors less likely (See article)
  • Turning disparate information into valuable resources, Speed, Quality and consistence of information (See article)

An example: An enterprise may produce within a single business process tens of thousands of new records per day. Beside the sheer number of rows, there is likely tens of even hundreds of columns to track each type of data. The data is then gathered daily for months or even years. The records are measured quickly in millions. A small business may be able to manage this data in an excel spreadsheets, where errors are likely to occur at any moment. However as businesses grow, the data must be consistent across users and time. Just imagine how slow it would be to manage all this data from a spreadsheet?

Databases are not equal

There is a great website, which tracks the popularity of the different types of databases. The ranking is very interesting to track competition across cloud-providers, but as well to see the types of databases available around the world. Relational databases remain very popular: e.g. PostgreSQL. However, it has become as well common to see other types of databases such as MongoDB. Especially, Graph-databases are growing in popularity (see link).

An example: Many of us have become very used to run SQL-language queries from a relational database. It is very simple to learn the basics of SQL language. We can run a query quickly by defining the table we want to look up, the columns we want the report to include and the filters we want to apply to limit data. We can even cross-query various tables within the database, where can exists millions of records within seconds. Despite we can modify the queries, our data structure remains constant within each table. In case our reporting needs change, we have actually very limited capability to modify this existing relational database. This is where NoSQL-databases become useful (See details on link). So instead of always having same attributes on our data like: company name, address and country — we would instead store images, which have particular metadata available on them.

Learn from mistakes

I have used excel for decades and SQL for around 10 years now, but I must confess I learned the hard way — data science project is not just about building the database.

Real life example: I was very excited to start a data science project, that I was assigned with my colleague. I was curious to help on the project and I knew my skills were valuable. I realized immediately, that the deadline was tight and we would need to be very efficient to provide a pilot product on time. I was very familiar with the data used and I knew it would be valuable to us. As we defined next actions, we somehow got the idea, that we had to build a database. Our reasoning based on several ideas, which all sounded plausible:

  • Majority of our data was already in a database, but we had to enhance it with external data sources
  • Our pilot product would manage around million rows of data and eventually tens of millions
  • Data had to be managed via special access
  • Data had to be centralized and available 24h

All this led us to believe — we could not really make anything else, than build the database from scratch and we got in the process of defining the right database product. This process turned out to be a learning curve, as we compared products, learned on setting them up, reviewed the confidentiality and data privacy option. It turned out to be rather slow to add new records, as we had large set of columns. Still, I was most concerned on the database maintenance. From past projects I had learned, that the product maintenance-work could jeopardize our future product releases, as our resources would be committed on updating the database with new columns of data. I was as well concerned, that after release, we would be changing our database to another one, again with its own specifics to learn.

After precious time spent on setting up a new database, which I knew still required lot of work — I decided it was best to ask for a help, spent a bit time to consider available options to move ahead. It turned out to be the right decision.

I realized quickly, that the SQL database design was consuming such effort, that I would end up building my data science career around databases, rather than the actual product and its business outcomes. My fellow Data scientists helped me to see, that I was not only limited to SQL-databases. I realized the csv-file could work out in the short term — which really made lot of sense from the learning perspective, as I would be now focusing my time on building the machine learning model, rather than technicalities of the database. I realized, my technical solution might be even better, if I used object-storage for loading the excels, instead of any sort of relational database. Another tip, I got was to instead create a pilot product using Python and SQLite.

SQLite-package

I had heard few time of speaking about SQLite and lot of various opinions — so I had kept it under list of topics to learn, but I had not spent any time on it. At the same time — I was learning using Flask and I came up with a guide, where Flask was used together with SQLite, which raised an immediate interest.

In fact it took just few simple lines of code, to get the SQLite database set up and my data stored as SQL-database. I had managed to build a working demo in minutes and I realized the simplicity of the Python code — allowed to get my database fully set within hours, rather than weeks. It made perfect sense as well for my career, as I wanted to deepen my Python-skills, especially around Flask, which my mentor recommended.

If you are interested in SQLite, I recommend to check plenty of online resources. I as well wrote a blog entry about it.

Finish the product

The business transformation is really fast and agile methodologies have become the standard on many projects — some of them touching critical business applications. However lot of business processes are very restricted and organizations tend to put limits on the way the technology can be accessed and used — projects have complex constraints to overcome. So to get managerial buy-in, we need to get the products for feel-touch of the users, so we can convince the management to invest more resources to further develop them.

It is essential to get the product launched. In our case, to achieve this, we had to learn from mistakes, focus on value-adding activities and find the right technology. A good project plan takes into account, the best technology fitting for the project.

Conclusions

I had started a project thinking I really needed a database — in the end we got a database running. However I was lucky, as I got the database running in a simple way, which I was not aware. I was so close to have my project had got cancelled, as I was about to miss the project schedule. So, I advise to think clearly the product, so you know you can manage its delivery and if needed to educate yourself with new technology.

As a data scientist, it is easy to get overwhelmed with all new technologies to learn, so we need to be careful to spend our time right.

If you now ask me: “Do I need a database?” — I will surely say: it depends upon the type of product, but certainly try first Python packages like SQLite and familiarize yourself with the various technologies of the databases and try to keep your work around Python, because most likely you will be able to manage directly with Python, most of your database needs. Most likely, a real database starts to make more sense, in case your product is so good, that you could hire a person just for its maintenance. In case you think some maintenance will be left to you, you will surely want to avoid building any extra burden to maintain.

--

--

Teemu Maatta
Teemu Maatta

Written by Teemu Maatta

Author (+200k views) in Artificial General Intelligence. Autonomous Agents. Robotics. Madrid.

No responses yet