Machine Learning for Classification

Git:https://github.com/elegantwist/catalog_classifier

One has to build a classifier of elements of a dictionary (company) based on a text description of the element company scope of interest.

The classification is as follows: based on the description of the element, a set of categories is chosen, to which the record relates: from the preset options, carrier type, category type, and service are selected – and consolidation us performed for these categories.

Thus, in the end, we get a catalog of type: Description — Carrier — Category — Service, and it looks like a table:

where Description is set from the outside, and Carrier, Category, and Service (if not filled) are filled based on the Description

Until now, the layout — stating the categories — has been performed on a regular basis, in manual mode. As a result, the history has accumulated about the sections of the catalogs chosen, based on the given Description.

The goal has been set: automating the job – minimization, or even elimination of human involvement in the process of marking data.

The following describes how this was implemented.

Three approaches were considered for resolving the problem:

1. Using uniform rules specified by a human, when a single fixed Description clearly matches a fixed Carrier-Category-Service set.

2. The use of the statistical approach, when the frequency of occurrence of words combinations in the Description would be transformed into the frequency of categories matches.

3. The use of the machine training approach for training the model of Description transformation to proper categories in the recorded history, with the final task of making the trained model able to choose the desired categories for the specified Description.

The third variant was chosen, where the model of Decision Tree Classifier from the sklearn package is used. The decision was made based on the analysis of the data history, and a conclusion that the logic of converting the Description to categories are best described by a combination of approaches 1 and 2, i.e. the large number of strictly set rules that in a statistically optimal way describe the required conversion — which perfectly fits the logic of the Decision Tree Classifier.

Task implementation was divided into four stages:

1. Loading the data for processing,

2. Preparing the data for classification,

3. Model training and obtaining the classification result, and

4. Validation of the result.

The standard ML-bundle is used in the implementation: pandas + numpy + sklearn.

1. Loading the data for processing

Make the following steps:

– load the data from a CSV to pandas tables,

– select the data to be the basis for training the model (base), and

– select the data to be used for applying the trained model (the target data).

For the purposes of the article, the target data will be collected from the existing base dataset in order to compare the obtained values with the classified ones and to adequately validate the obtained model.

Loading the data is trivial, with the use of the read_csv. The result is a pandas table to be used for further work

After downloading, divide the training and the target data. For the purposes of validation, 5% of the total corpora is enough, therefore we only have to mix the source data, and to allocate any 5% and 95% remains for training — this is a typical “target” amount, therefore we are t be oriented to it.

Here is the code that does it:

mask_val = np.random.rand(len(catalog_dt)) < 0.95
train_dt = catalog_dt.loc[mask_val].reset_index(drop=True)
validation_dt = catalog_dt.loc[~mask_val].reset_index(drop=True)

2. Preparing the data for classification

Once the needed data slices are obtained, they are to be prepared for subsequent use.

The problem is that Description columns contain the data with unfixed content in an unfixed location, and the content may be in Russian, the description text can contain “garbage” like extra spaces, comments in brackets, punctuation marks, or various letters, and reference names of websites may be specified both with and without domain (for example, mail and mail.ru are used with the same semantic interpretation).

For a human, these nuances are not a problem. However, in order to correctly build an ML model capable of functioning “in a stream”, a “unified database” data corpora is required. Therefore, before training the model and its use for new data layout, Description should be made uniform.

To do this:

a) Create data purging rules and apply them to the data,

b) Formulate the rules for transformation of the data obtained after purging for training, and use them for the data,

c) Convert the text data to the numeric format to be used in the DTC, and

d) Convert the result of DTC work back to the text description of classes.

Implementation:

a) By the results of data analysis, the following purging rules have been formed for the incoming string of the description that may appear in Description field (translit, obviously, converts Russian letters into Latin letters):

def clean_rule(str_value=""):

  vl = str_value
  vl = re.sub('(?<=\().+?(?=\))', '', vl)
  vl = vl.replace("(", "")
  vl = vl.replace(")", "")
  vl = vl.strip()
  vl = vl.replace(".COM", "")
  vl = vl.replace(".RU", "")
  vl = vl.replace(".INFO", "")
  vl = vl.replace(".ORG", "")
  vl = vl.replace(".NET", "")
  vl = translit.translify(vl)
  vl = vl.upper()
  return vl

now we can use this rule everywhere you need to bring the raw data to the “unified” form.

b) After that, prepare the processed table for training using the DTC

The fact that the Description fields contain the quanta of information separated by”;” and it is not known beforehand whether they are set in a single sequence, and whether the quanta contain errors and “garbage”.

Thus, for each row in the table, do the following:

– Split the string into its elements, or quanta,

– For each element, use the cleanup rule,

– Restrictions do not allow assessing the meaning of each quantum. Thus, it is necessary to somehow reduce the problem with unfixed sequence of the quanta. To resolve this problem, it is enough to arrange the quanta in the horizontal length of the element ascending order, and

– Each line in the Description field, take the first 4 elements (both filled or empty) and use them as columns in the table to be used for training (columns Article-N).

All the above steps are implemented in just three lines with iteration through the rows in the source table:

a_l_list_raw = article_list_str.split(";")
a_l_list_cl = [clean_rule(item) for item in a_l_list_raw]
a_l_list = sorted(a_l_list_cl, key=len)

As a result of processing, you get a table like this:

c) after bringing the table to the “common” form, it is necessary to encode possible variants of the values in column Article (an information quantum for the Description field) into an internal digital classifier, which is suitable for training and for use in the tree model. Thus, convert article values to the binary code – the so-called called one-hot encoding.

For doing so, you will need to:

– concatenate the training and test tables. This is necessary, since some values in the article column can only be contained in a single table, and we have to use the entire catalog (all possible article values). And with one-hot encoding, the variants of the value are transformed into columns, and we have to work with a beforehand known number of columns values,

– perform binary encoding of the values in the catalog, using the built-in pandas – get_dummies function, and

– the result of the transformation is to be re-divided into the training and target data tables.

Implementation:

var_columns = ['Article1', 'Article2', 'Article3', 'Article4']

# we need to append one to another to get the correct dummies
res_dt = train_dt[var_columns].append(predict_dt[var_columns]).reset_index(drop=True)
res_dt_d = pd.get_dummies(res_dt)

# and split them again
len_of_predict_dt = len(predict_dt[var_columns])
X_train, X_test = res_dt_d[:-len_of_predict_dt], res_dt_d[-len_of_predict_dt:]

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

y_train_ts = pd.get_dummies(train_dt['TypeService'])

The obtained one-hot encoded table is to be passed for model training, and for target data classification.

At the stage of using the model, a problem arises in the fact that the result of classification will also be obtained in the form of a one-hot encoded binary data set, which cannot be used directly as a final result. It is, therefore, necessary to retranslate the result of the model operation (as a consequence – the result of the get_dummies function) back to a human-friendly catalog.

For doing so, use the following function:

def retranslate_dummies(pd_dummies_Y, y_pred):

  c_n = {}
  for p_i in range(len(pd_dummies_Y.columns)):
    c_n.update({
      p_i: pd_dummies_Y.columns[p_i],
    })
  d_df = pd.DataFrame(y_pred)

  y_pred_df = d_df.rename(columns=c_n)

  res = []
  for i, row in y_pred_df.iterrows():

    n_val = [clmn for clmn in y_pred_df.columns if row[clmn] == 1]

    if n_val == "" or len(n_val) == 0:
      res = res + [np.NaN]
    else:
      res = res + n_val

  return res

which inputs a complete coded directory, and the values obtained during work of the classifier model, and outputs re-translation into the specified original human-friendly format, which is the purpose of the system.

3. Model training and classification

For the process of training and categorization, a standard implementation of Decision Tree Classifier from the sklearn package is used.

At the input to the classifier for training, the data translated in the previous step and the expected classification result are supplied.

After training, the trained model receives encoded target data and encoded one-hot classification result is received at the output. The result is to be converted back to the human-friendly form.

We use this approach individually for each section: first, for locating the desired Service, then for the Category and finally, for the Carrier.

4. Validation of the result

Check that the result meets expectations. Since we use a known result of classification (the 5% from the original dataset), we can just compare the real and “correct” categories selected by the user to those that were automatically chosen as a result of the classification process.

By the conditions of the, if at least one of the categories is filled incorrectly (the automatically filled is different from the existing in the catalog), we believe that filling was “conditionally incorrect”. “Conditionally incorrect” fields may contain the following values:

– definitely incorrectly identified by the classifier,

– chosen values from the catalog with the Description not met before,

– the values of the catalog selected by Description, which have not been met before, but less than the number of times adequate for unambiguous classification,

– if the conversion variant chosen for classification occurs a bit more rarely than ever, but with that, is present in the history of the record about conversion to the desired class, and

– if the version of the classification has been met less than a determined share of similar cases.

The formalized goal of this work consists in reducing the % of such conventionally-incorrectly filled categories.

The obtained results

As a result of the implementation of the solution, accuracy was achieved in 60 % of definitely correctly filled categories, with definitely incorrect filling in 5 % of cases, and 35 % of “conditionally incorrect” choice.

Further analysis of the results showed that the obtained 35 % “conditionally incorrect” classifications do not affect the overall result of the system operation (these were either new values, the classification of which was successful, but could not be verified due to the lack of history of translation, or rarely mistakenly filled records in the history), and allow to remove human involvement in the classification of the catalog with sufficient degree of confidence, by moving the classification to a fully automated process.