End To End Scalable Machine Learning Project On Google Cloud With Beautiful Front End with Big Data-I

As you Explore the dataset in the past post in this post our key focus will be building sample model to test our machine learning knowledge and in the last post we will build it in scale and deploy .If You have not read my previous post(this will hurt me ) you can read it here before continue to this one or recap it if you forget the dataset and aim of our project(not life).

First we need to create our dataset for our sample model because if you remembered we have millions of rows that we can’t fit in our 15 inch banged up macbook pro with 4 gigs of ram so we need to sample the data very carefully such that all kind of examples are included in the field unlike our paid biased media survey reports where Donald Trump is making America great again or Narendra Modi is The Supreme leader of India our sampling needs to be biased free to make good features.

Things needs to follow to have a good feature in dataset

So by following all the 5 points one of the important note is point no 2 what we know @time of prediction in our case can you guess what is it write it down to the comment section without seeing the next slide and come back later to check

So now comes making data to test it with our model but how can we achieve a dataset that is biased free unlike our societies paid media group as we are literate enough to understand the meaning of biased-free sampling is to choose random sampling. But here is a flaw in doing random sample in our case can you identify it ………………let’s discuss suppose you have triplets in the dataset which means same entry 3 times so you can see if all your train data contains this three rows and others like validation or test dataset (I assume you are familiar with train,test,validation dataset) than by common sense you can tell your model will give low performance no machine learning required to answer so how do we solve this issue here comes the FARM_FINGERPRINT function it took a string as input and convert it into a numeric hash type.
The output of this function for a particular input will never change.

Examples

WITH example AS (
  SELECT 1 AS x, "foo" AS y, true AS z UNION ALL
  SELECT 2 AS x, "apple" AS y, false AS z UNION ALL
  SELECT 3 AS x, "" AS y, true AS z
)
SELECT
  *,
  FARM_FINGERPRINT(CONCAT(CAST(x AS STRING), y, CAST(z AS STRING)))
    AS row_fingerprint
FROM example;
+---+-------+-------+----------------------+
| x | y     | z     | row_fingerprint      |
+---+-------+-------+----------------------+
| 1 | foo   | true  | -1541654101129638711 |
| 2 | apple | false | 2794438866806483259  |
| 3 |       | true  | -4880158226897771312 |
+---+-------+-------+----------------------+

if you want to learn more visit here.

You will find some highly enthusiastic people or some people to show their unprecedented presence in your organisation saying to you why we are not using all the dataset for model and using such sampling to sample our model show them this:

Now here comes the question at the end how can you sample the data at the end well sample it with simply by sql command like this:

Remember this is a different example with a intention to show that FARM_FINGERPRINT had taken care of our previous problem and RAND() function with our required data generates such amount of data we required suppose we have 10000000 records then following the code will produces 100000 records of data .

Once you are in datalab (jupyter environment for running scripts offered by Google) first set again your bucket,region and project id which will be same as your previous one then we will write a query to sample the data

import google.datalab.bigquery as bq
query = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks,
  ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""


But think we are going to select all the data from 2000 onwards and it will give us approximately 33240602 numbers of row that we don’t want now

so here comes the hash and rand if we hash with say 70% we will get 70% of the data then we can randomize 1% of it and we will left with fully distributed 7% of data got it we get what we required.

In our case we have 96 unique hash month here is the sample looks like

hashmonthnum_babies
06392072535155213407323758
18387817883864991792331629
2328012383083104805359891
39183605629983195042329975
48391424625589759186364497

This chunk of code will produce our train and validation dataset for the reason stated above

trainQuery = "SELECT * FROM (" + query + ") WHERE MOD(hashmonth, 4) < 3 AND RAND() < 0.0005"
evalQuery = "SELECT * FROM (" + query + ") WHERE MOD(hashmonth, 4) = 3 AND RAND() < 0.0005"
traindf = bq.Query(trainQuery).execute().result().to_dataframe()
evaldf = bq.Query(evalQuery).execute().result().to_dataframe()

simple tips to get all rows with value hit traindf.describe() if you get results like this

weight_poundsmother_agepluralitygestation_weekshashmonth
count13368.00000013375.00000013375.00000013265.0000001.337500e+04
mean7.21593727.3807101.03603738.5973614.425355e+18
std1.3388196.1885860.1961632.6009552.795168e+18
min0.50044912.0000001.00000017.0000001.244589e+17
25%6.56316222.0000001.00000038.0000001.622638e+18
50%7.31273327.0000001.00000039.0000004.329667e+18
75%8.05403832.0000001.00000040.0000007.170970e+18
max18.00074450.0000004.00000047.0000009.183606e+18

change in the query
trainQuery = “SELECT * FROM (” + query + “) WHERE MOD(hashmonth, 4) < 3 AND RAND() < 0.0005 AND
weight_pounds IS NOT NULL AND
gestation_weeks IS NOT NULL” to eliminate null value at the source.

now to clean the data you can use your business logic like

df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]

Then also plurality column needs to convert into a meaningful column so you can map its value via dictionary

dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))

or
{1:"Single(1)",2:"Twins(2)"} in this way which suits you 

put all this in a function and return the cleaned df like this

def preprocess(df):
  # clean up data we don't want to train on
  # in other words, users will have to tell us the mother's age
  # otherwise, our ML service won't work.
  # these were chosen because they are such good predictors
  # and because these are easy enough to collect
  df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]
  
  # modify plurality field to be a string
  twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
  df['plurality'].replace(twins_etc, inplace=True)
  
  # now create extra rows to simulate lack of ultrasound
  nous = df.copy(deep=True)
  nous.loc[nous['plurality'] != 'Single(1)', 'plurality'] = 'Multiple(2+)'
  nous['is_male'] = 'Unknown'
  
  return pd.concat([df, nous])


In the final versions, we want to read from files, not Pandas dataframes. So, write the Pandas dataframes out as CSV files. Using CSV files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers.

traindf.to_csv('train.csv', index=False, header=False)
evaldf.to_csv('eval.csv', index=False, header=False)

check your csv files

%bash
wc -l *.csv
head *.csv
tail *.csv

Now here comes the model building part which model should i use now even you can test your model locally remember we made our datasets like kaggle used to provide us so go to Google colab upload and try all possible models to understand best working model there is a view that Wide and deep model works best on structured data the reason explained below:

structured data consists of two types of data dense(continuous) and sparse(categorical/discrete) in the below example “employeeId” is a sparse where price is a dense feature.

why not dnn because of:

And if you have lots of sparse data than:

So why wide and deep model:

so what are my syntax for writing this:

I would suggest you if you are uneducated in computer field like me(i did my bachelors in civil) don’t blindly follow any github open corresponding documentation page(here tensorflow ) and understand the functions in details before apply so even you made a typo by following the error you can spot your mistake or else you have to post in stackoverflow.

I am showing the dnn model here you should try the linear and the wide one locally on your machine .

import shutil
import numpy as np
import tensorflow as tf
print(tf.__version__)

import the required modules Python shutil module enables us to operate with file objects easily and without diving into file objects a lot. Tensorflow for ML framework and Numpy for numerical processing.

Then Create an input function reading a file using the Dataset API

def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(value_column):
      columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
      features = dict(zip(CSV_COLUMNS, columns))
      label = features.pop(LABEL_COLUMN)
      return features, label

Then Create list of files that match pattern

file_list = tf.gfile.Glob(filename)

Then create dataset and provide the results to the Estimator API

# Create dataset from file list
    dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                 .map(decode_csv))  # Transform each elem by applying decode_csv fn
      
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
        dataset = dataset.shuffle(buffer_size=10*batch_size)
    else:
        num_epochs = 1 # end-of-input after this
 
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()
  return _input_fn

putting it all together

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(value_column):
      columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
      features = dict(zip(CSV_COLUMNS, columns))
      label = features.pop(LABEL_COLUMN)
      return features, label
    
    # Create list of files that match pattern
    file_list = tf.gfile.Glob(filename)

    # Create dataset from file list
    dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                 .map(decode_csv))  # Transform each elem by applying decode_csv fn
      
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
        dataset = dataset.shuffle(buffer_size=10*batch_size)
    else:
        num_epochs = 1 # end-of-input after this
 
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()
  return _input_fn

Next, define the feature columns

# Define feature columns
def get_categorical(name, values):
  return tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list(name, values))

def get_cols():
  # Define column types
  return [\
          get_categorical('is_male', ['True', 'False', 'Unknown']),
          tf.feature_column.numeric_column('mother_age'),
          get_categorical('plurality',
                      ['Single(1)', 'Twins(2)', 'Triplets(3)',
                       'Quadruplets(4)', 'Quintuplets(5)','Multiple(2+)']),
          tf.feature_column.numeric_column('gestation_weeks')
      ]

To predict with the TensorFlow model, we also need a serving input function. We will want all the inputs from our user.

# Create serving input function to be able to serve predictions later using provided inputs
def serving_input_fn():
    feature_placeholders = {
        'is_male': tf.placeholder(tf.string, [None]),
        'mother_age': tf.placeholder(tf.float32, [None]),
        'plurality': tf.placeholder(tf.string, [None]),
        'gestation_weeks': tf.placeholder(tf.float32, [None])
    }
    features = {
        key: tf.expand_dims(tensor, -1)
        for key, tensor in feature_placeholders.items()
    }
    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)


 
# Create estimator to train and evaluate def train_and_evaluate(output_dir):   EVAL_INTERVAL = 300   run_config = tf.estimator.RunConfig(save_checkpoints_secs = EVAL_INTERVAL,                                       keep_checkpoint_max = 3)   estimator = tf.estimator.DNNRegressor(                        model_dir = output_dir,                        feature_columns = get_cols(),                        hidden_units = [64, 32],                        config = run_config)   train_spec = tf.estimator.TrainSpec(                        input_fn = read_dataset('train.csv', mode = tf.estimator.ModeKeys.TRAIN),                        max_steps = TRAIN_STEPS)   exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)   eval_spec = tf.estimator.EvalSpec(                        input_fn = read_dataset('eval.csv', mode = tf.estimator.ModeKeys.EVAL),                        steps = None,                        start_delay_secs = 60, # start evaluating after N seconds                        throttle_secs = EVAL_INTERVAL,  # evaluate every N seconds                        exporters = exporter)   tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) 



 
# Run the model
 shutil.rmtree('babyweight_trained', ignore_errors = True) # start fresh each #time
 train_and_evaluate('babyweight_trained') 

Monitor and experiment with training

from google.datalab.ml import TensorBoard
TensorBoard().start('./babyweight_trained')

In TensorBoard, look at the learned embeddings. Are they getting clustered? How about the weights for the hidden layers? What if you run this longer? What happens if you change the batchsize?In [ ]:

for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print 'Stopped TensorBoard with pid {}'.format(pid)

we will see results like this on tensorboard

Congratulation you have made it you just made the model your self now try to build it with wide and deep model in next blog post we will deploy our model and serve the work with a front end till then check out my medium page for more posts . Any questions you have you can comment below.

If you like the post and want your colleagues or friends to learn the same hit the like button share it on linkedin ,facebook, twitter let’s grow together for a bias free machine-human compiled future.

One thought on “End To End Scalable Machine Learning Project On Google Cloud With Beautiful Front End with Big Data-I

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s