Tools and languages for data science
Tom Davenport lists his favourite tools and languages for data science.
There are loads of tools and languages that a data scientist can use nowadays. Here are a few of the ones that I turn to…
Languages for data science
According to a poll by kdnuggets, Python and R are currently the leaders in the languages that data scientists use most often. These tools are free to download and install on your computer. They are what are known as open source languages- this means that they are created and developed by a community and free to use in your own systems.
It’s generally regarded that the best way of installing Python on your computer if you plan to do data science is to install Anaconda. This installation takes up quite a bit of space, but it will include almost all the packages you need for working in data science.
For all the ones that it doesn’t include, the distribution installs a very powerful package manager called Conda and installs the python package manager pip for you, which (used to be at least) notoriously difficult to install on Windows.
I’ve found Anaconda pairs nicely with Pycharm which is a really powerful Python IDE. An IDE (Integrated Development Environment) allows you to format code nicely and highlights where you may have made mistakes.
Plot.ly is a nice visualisation library, that is usually used online with their platform, but the offline version if you are generating graphs for reports is free.
R Studio remains the best way to use R. I also really like R Shiny for interactively working with data. Shiny can be deployed online to websites or to make interactive dashboards.
The Rattle GUI also provides a great entryway to machine learning.
Tools for machine learning
A good introduction to data science is data mining – a really interesting aspect of data science which uses machine learning. This is where you train computer to classify or predict an outcome, or to group similar items within your data. These data mining models are behind many of the outputs of big data that makes the field so interesting. There are many freely downloadable machine learning GUI (Graphical User Interface) that allow you to start making these models without needing to code. You still need to understand what is happening though, but once you have reviewed the theory they become a bit easier.
Weka is also a really great visual tool for learning machine learning. Their book “Data Mining: Practical Machine Learning Tools and Techniques” is a great introduction to the tool and one of the best books on data mining available, in my opinion.
Rapidminer is another visual GUI tool, and comes with a series of really informative and interactive tutorials. There’s a free tier, but it will only process a certain amount of rows for free until you’ll feel pushed to the enterprise product.
Tools for data visualisation
There’s also a multitude of tools that allow you to visualise your data effectively, to create charts really quickly. They are great for creating presentations and dashboards and sharing them with others. They’re steadily getting smarter too – many are beginning to offer machine learning into their tools.
Tableau is an enterprise tool that is currently the leader for visualising data and analytics according to Gartner. It’s quick and scales pretty well and the design of the tool is very good. There’s a free version for students, and they recently reviewed their pricing structure making it affordable for everyone else.
PowerBI is Microsoft’s flagship visualisation tool and is the other tool that dominates the market according to the Gartner report. It’s pretty comprehensive and powerful, but it’s missing some features that you would expect it to have already. This unfortunately includes things like formatting dashboards, sharing and exporting. The pricing model is under review, and some features that would make it very useful have been pushed to their premium tool. There’s a free version available, and the “pro” stage between the free version and the premium version is included in many organisation’s Office 365 accounts. You can get pretty far with the free and “pro” versions however, and every month they push out a raft of new features.
Google Data Studio
Google’s free version of their new Data Studio works quite well for analytics datasets, or if you have access to a lot of Google tools (such as BigQuery, Google Analytics 360, Sheets) then it’s really good.
Databases for data science
I’d be remiss to not include some resources on how to store big data. You can either install a SQL or noSQL database on your own hardware, or use databases on a cloud computing platform.
MySQL, Postgres and Oracle are great for high volume datasets where there is a defined order. These mostly use the Structured Query Language (SQL) which is very useful to learn to understand and is a skill many employers seek. They handle volume really well. They enforce order and can scale well.
The other aspects of having a large variety of data sources tend to be better served by NoSQL tools such as (and certainly not limited to) MongoDB or Cassandra. These tools can often handle high volumes well but can also be useful when the data does not always have a defined order or the speed of which it arrives or is deleted would overwhelm a traditional relational database. These databases are increasing in use within large organisations for specific purposes and are used often by startups.
These databases are increasingly moving to the cloud. This is because storage space within the cloud is relatively cheap and organisations don’t have to buy new equipment if they need more space- they just rent more power. There’s free tiers available for practise too, I particularly recommend mlab free tier for practise.
Amazon Web Services
AWS is the go-to cloud computing platform for many developers. They have a huge market share in this area, and a lot of developers have been using it for a while now. It allows companies to scale up their storage or computing capacity very quickly and cost efficiently. The free tier and pricing works much better than some other providers. It’s designed for developers, and whilst there is a great community out there, the support from Amazon can be a bit limited, there’s not so much examples of code. There’s quite a learning curve involved in using it effectively, so training would be necessary.
Microsoft Azure is a growing cloud computing platform for developers. There’s a great amount of systems that you can set up with it. You can create websites, databases, virtual machines in a few clicks. They are always improving and adding new services. It can be a bit of a wait sometimes for services that you think that it should have already though. The pricing system is a bit irritating too – although it’s certainly capable of scaling between small instances to huge instances, the free development tiers can be very limiting. It often feels like it’s designed only for large organisations. The support systems are great though, and the documentation is usually pretty comprehensive.
Google Cloud Platform
Google comes in third in the list of the most used cloud platforms. There’s not as many connections to third party applications available as is present in AWS and Azure. I haven’t really tried this one, but they do have all the components there to make a useful platform.
Teach yourself data science
If you are interested in teaching yourself how to use these tools and languages, check out my blog post on How to Learn Data Science