NLPx

Tales of Data Science

Conditional Random Fields (CRF): Short Survey

On a picture above you may see a random field.

Currently, many of us are overwhelmed with mighty power of Deep Learning. We start to forget about humble graphical models. CRF is not so trendy as LSTM, but it is robust, reliable and worth noting.

In this post, you will find a short summary about CRF (aka Conditional Random Fields) – what is this thing, what is it for and some interesting facts. Enjoy!

Read More

2,901 total views, 5 views today

Examples of Data Analysis Reports

Introduction

Recently, I have discovered some old examples of data analyses, which were carried out for study purposes by my colleagues and me in 2013, during the Data Analysis course on Coursera. These examples are based on the analyses conducted on two datasets – Lending Club company dataset and Samsung smartphones dataset. The examples DO NOT contain advanced approaches to Data Analysis and Data Mining, but they will come in handy to everyone who need to see how a decent data analysis report should look like.

But remember: the following data analysis reports were composed to be read by persons at least acquanted with standard approaches to data analysis and predictive modeling.

Business-driven data analysis for non-technical people (such as managers) should be composed in other way:

  1. with much less or no (if possible) technical details,
  2. thorough yet simple description of what did you do and why did you do so
  3. clear practical recommendations, which can be directly applied to business.

If you still want to continue, you’re welcome! If not – you’re still welcome!

Read More

12,883 total views, 4 views today

A tale about LDA2vec: when LDA meets word2vec

catdog_word2vec_cropped

UPD: regarding the very useful comment by Oren, I see that I did really cut it too far describing differencies of word2vec and LDA – in fact they are not so different from algorithmic point of view. So I corrected this post. Errare humanum est, stultum est in errore perseverare, you know. Also, now I really recommend you to read this presentation of Yoav Goldberg

A few days ago I found out that there had appeared lda2vec (by Chris Moody) – a hybrid algorithm combining best ideas from well-known LDA (Latent Dirichlet Allocation) topic modeling algorithm and from a bit less well-known tool for language modeling named word2vec.

You can also read this text in Russian, if you like.

And now I’m going to tell you a tale about lda2vec and my attempts to try it and compare with simple LDA implementation (I used gensim package for this). So, once upon a time…

Read More

12,784 total views, 3 views today

Top 10 countries on StackOverflow and GitHub

stackovwerflow_github

Here I would like to show you the result of my analysis, which I conducted in late October, 2014 in order to find and outline statistical trends, connected with users from different countries.

How does the amount of users from different countries changes with time? What countries there are most users from? Citizens of what countries commit on GitHub more? These questions I wanted to answer while working on this analysis. Please note, that this analysis is “shallow” enough – I mean that I didn’t analyze StackOverflow rating system thoroughly, just in a nutshell. Please mention this.

The study was conducted on 24 October, 2014, and if you are interested, I may update this study to bring it up to date, just let me know 🙂

You also may read this post in Russian if you like. There are more information about presence of Russian citizens on StackOverflow and GitHub.

Read More

6,052 total views, 1 views today