Yelp Data Analysis

  • Date: Oct 2017
  • Category: Data Science
  • Key Tags: Big Data, Hadoop, Spark

A project using yelp Dataset to do basic big data analysis.

Introduction

Extract around 490,000 records on Yelp Dataset which related to restaurants and users. Use Hadoop Map-Reduce to derive some statistics from dataset, such as too 10 average rating restaurants in some specific area. Implement Spark with running a shell script on the same dataset to validate the result and compare the pros and cons of 2 techniques.

Q: List the business_id, full address and categories of the Top 10 businesses using the average ratings.

Q: List the 'user id' and 'rating' of users that reviewed businesses located in “Palo Alto”

Setup

Hadoop

Run script:

start-dfs.sh

Input data files like:

hdfs dfs -put <business.csv>
e.g: hdfs dfs -put ~/Documents/input_files/business.csv /parallels/input
hdfs dfs -put <review.csv>
e.g: hdfs dfs -put ~/Documents/input_files/review.csv /parallels/input
hdfs dfs -put <soc-LiveJournal1Adj.txt>
e.g: hdfs dfs -put ~/Documents/input_files/soc-LiveJournal1Adj.txt /parallels/input
hdfs dfs -put <user.csv>
e.g: hdfs dfs -put ~/Documents/input_files/user.csv /parallels/input

Spark

Put Source code file in PyCharm, together with data files, then click run. All output result are in output file.

Source