PyData Warsaw 2017 – thoughts & impressions

Sunscrapers Team

7 November 2017, 5 min read

What's inside

Upon Arrival at PyData Warsaw 2017

A couple of weeks ago, a few members of our team had the pleasure to participate in PyData Warsaw 2017 held at the Copernicus Science Center in Warsaw on October 19-20, 2017.

PyData conferences aim to connect users and developers of data analysis tools to meet, share ideas, and learn from one another. This event was the first one and it was already a huge success! The community plans to gather every year to discuss applications of Python tools (plus tools using R and Julia) and meet the challenges in data management, processing, analytics, and visualization.

Here’s what this year’s PyData conference looked like.

Upon Arrival at PyData Warsaw 2017

PyData attracted over 300 participants, and it’s clear that the number was much higher than expected by the organizers - which is a great thing! The location was just excellent. The Copernicus Science Center is located in the central part of the city, but somehow you don’t feel that annoying city buzz - many people we talked to really liked the venue.

A minor organizational glitch was the division of the auditorium space. The conference was divided into 3 streams, and 2 streams would always get something like 1/4 of the conference room. That’s why it was often hard for participants to find a seat in these streams.

Keynote Lectures

The first keynote speaker was Jarek Kuśmierek, a Senior Engineering Manager at Google. He talked about the current revolutions in data science, especially in machine learning. He used many examples to show how Google applies machine learning - for example, they used machine learning in the network that operates the ventilation system at their data centers, allowing for 40% of savings in energy use.

PyData Warsaw

He also presented Google’s API for machine learning. A developer who has never worked with machine learning before will be able to create an application that can recognize human speech or classify images automatically and tag them. All in all, it was hard not to agree with Jarek - we are right in the middle of this revolution.

The second keynote lecture was by Radim Řehůřek, the creator of Gensim and machine learning consultant. Radim fascinating talk was about interpretable data models. As machine learning algorithms are becoming more popular and advanced, we are beginning to lose the understanding and control of them. We increasingly treat these algorithms like black boxes to which we just feed the data, and they give us a result, without us knowing exactly how the program arrived at that conclusion. A potential solution to that problem lies in building and using tools that help developers understand how a given neural network works like. We should also try to be more responsible about using machine learning and stop feeding algorithms with vast amounts of data without a second thought.

Radim said we should try to come up with the result on our own first. He suggested that we tend to trust in machine learning a little too much today - and we couldn’t agree more.

Other Interesting Talks

In general, we saw a lot of talks about Natural Language Processing (NLP), and it's clear that there is much work still to be done in this area. We're talking primarily about the Slavic language family which includes fusional languages – contrary to English, languages like Polish or Russian have more complex rules and contain many ambiguities. They are also surrounded by a much smaller community and are hard to process.

We learned about tools that could be useful in our projects at sunscrapers: word2vec, GloVe, fasttext, to name just a few.

Here are 3 talks we found particularly interesting:

Szymon Warda offered an interesting review of alternatives to databases and told us what type of databases are most useful for specific applications. Fun fact: Apache Accumulo is a database that was created for security purposes by none other than the NSA...

Another interesting talk was given by Kornel Lewandowski who looked at personal data security in medical documentation. Analyzing medical documentation can be very productive - however, these documents often contain plenty of personal data that needs to be cleared before analysis. Kornel showed us various techniques for identifying personal data, for example regular expressions, dictionaries, rule-based methods, machine learning based named entity recognizers. We got a view of the entire workflow responsible for that type of function, as well as the architecture.

The last talk that made a great impression on us was “Despicable machines: how computers can be assholes” by Maciek Gryka. As you can tell from the title, the talk was dedicated to a phenomenon known as the machine bias. Machine learning algorithms that analyze data about humans can easily learn behaviors that were not intended by developers. For example, there exists an algorithm that calculates the likelihood of committing a future crime that learned to take into account factors like skin color or facial expression. We don’t have an easy solution for this controversial issue yet, though interpretability might be one.

To put it simply, models that serve to describe and judge humans need to be understandable to us. We should know exactly how they work and why they deliver specific results. With that knowledge, we will be able to modify these models to avoid machine bias. You can see Maciek’s talk on the topic here .

Naturally, there were many more interesting presentations and we wish we could talk about them here. To get the idea, have a look at the schedule to see short descriptions and abstracts of all talks.

PyData Warsaw 2017 was packed with inspiring talks that showed us all some pretty smart solutions and potential directions for the future. We wish to thank the organizers for making that happen - it’s great to be part of the PyData community!

PyData Warsaw