В этом проекте:
1) Написали оптимальный запрос для нашей аналитической модели на сервисе BigQuery.
2) Обучение нашей модели машинного обучения.
Прежде всего, я беру общедоступный набор данных под названием «Поездки в желтом такси Нью-Йорка».
Сначала мы начнем с изучения нашего набора данных.
#standardSQL SELECT TIMESTAMP_TRUNC(pickup_datetime, MONTH) month, COUNT(*) trips FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015` GROUP BY 1 ORDER BY 1
Этот запрос выдаст вам вывод со столбцом в виде месяца и количеством поездок, совершенных в этом месяце.
Теперь мы выберем функции (столбцы) для обучения нашей модели.
- Сумма пошлины
- Сумма тарифа
- Час дня
- Выберите адрес
- Адрес доставки
- Количество пассажиров
#standardSQL WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips
Для создания модели BigQuery ML нам нужно сначала создать набор данных в нашем проекте gcp.
— Я называю свой идентификатор набора данных для такси.
Теперь давайте создадим и обучим наше машинное обучение.
Эта модель должна быть линейной регрессии, потому что мы должны предсказать определенное значение.
CREATE or REPLACE MODEL taxi.taxifare_model OPTIONS (model_type='linear_reg', labels=['total_fare']) AS WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips
Теперь мы оценим нашу модель ML.
#standardSQL SELECT SQRT(mean_squared_error) AS rmse FROM ML.EVALUATE(MODEL taxi.taxifare_model, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL ) SELECT * FROM taxitrips ))
Оценив нашу модель, вы получите RMSE в размере 9,47 доллара…
Наконец, мы собираемся прогнозировать стоимость поездки на такси с помощью этой модели анализа данных.
#standardSQL SELECT * FROM ml.PREDICT(MODEL `taxi.taxifare_model`, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL ) SELECT * FROM taxitrips));
Наконец, мы можем легко увидеть в выходной таблице, в которой есть предсказанная стоимость этой конкретной поездки, т.е. predicted_total_fare на основе предоставленных нами параметров с фактическим тарифом, т. е. total_fare, для сравнения прогнозов.
Ура!! Мы успешно создали модель и спрогнозировали стоимость проезда на такси, и все это на Google Cloud Platform!✌