Hive – FitBit JSON example

Here’s a quick mockup of working with the JSON data (with nested array elements). 1) Create an HDFS directory: hadoop fs -mkdir -p /<hdfs_path>/fitbit 2) Copy your files to the local Unix server (into your home directory, perhaps) WINSCP / SCP / FTP / WGET, for example 3) Put your files into HDFS: hadoop fs -put /<local_unix_path>/*.json /<hdfs_path>/fitbit Then from the Hive CLI: use <your database or just use default>; drop table if exists fitbit; create external table fitbit ( `activities-heart` array<struct<customHeartRateZones:string ,dateTime:string ,heartRateZones:array<struct> ,value:string>> ,`activities-heart-intraday` struct<dataset:array<struct> ,datasetIntrerval:string ,datasetType:string> ) ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’ STORED AS textfile LOCATION ‘/<hdfs_path>/fitbit’ TBLPROPERTIES ( ‘serialization.null.format’=”); select * from fitbit; I just made up a couple of examples of querying the data: ———————————————————————- — How many calories did I burn each day? ———————————————————————- select datetime ,sum(calories) from (select expl.datetime ,expl.heartratezones.caloriesout as calories_array from fitbit lateral view explode (`activities-heart`) tbl as expl) z lateral view explode (calories_array) tbl as calories group by datetime; ———————————————————————- — How many times did my heart beat each day? ———————————————————————- select datetime ,sum(heartbeats) from (select expl.datetime ,expl2.value as heartbeats from fitbit lateral view explode (`activities-heart`) tbl as expl lateral view explode (`activities-heart-intraday`.dataset) tbl2 as expl2) z group by...

Common Java Errors

Here’s a common error which we’ve seen several times:   Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable   There’s an issue with doing a “show create table” and then executing the resulting “create table” statement if the table is ORC. When you do the “show create table”, you get this: STORED AS INPUTFORMAT   ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT   ‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’   But if you create the table with those clauses, you will then get the casting error when selecting.  To fix this, you need to change your “create table” statement to just this:                STORED AS ORC...

Moving Large Data Sets via SSH

Here is a simple way to move a large data set from a source system, to HDFS without having to make an intermediate copy of the data.  Of course, using the Hadoop ecosystem tools like Flume or Sqoop would be preferred, but in this scenario my data nodes aren’t able to see my source system due to network zoning issues, so my edge node is the only way of connecting back to my Hadoop Cluster. Here are the steps to complete this: 1) Establish Passwordless SSH Keys between the source and target system (If you don’t know how to do it, read this) – This is optional, but if you are automating, this makes it much easier. 2) On your source system, run the following command: cat /dir/to/your/file.txt | ssh hadoop@yourhost.com “hadoop fs -put – /your/hdfs/location/file.txt” (Notice the – here to accept the input from the cat command.) 3) To ensure accuracy, you can then run an md5sum on both your source and destination files.  Hint: You can pipe the output of an HDFS file to do an md5sum as a simple way of doing it. Like so: hadoop fs -cat /your/hdfs/location/file.txt|md5sum You can also use GZIP or another form of compression on both ends of this stream which can be helpful if network latency is your biggest bottleneck.  ...

Free Hadoop Training Resources

There is an incredible volume of free resources online to help learn the Hadoop ecosystem.  I’ve gone through these resources, Webinars, and VM tutorials myself, and these will teach you the basics of the Hadoop landscape. A large portion of all of these resources are not vendor-specific.  The fundamentals of the ecosystem (HDFS, MapReduce, Hive, Pig, Sqoop, Flume, etc) are the same across all vendors.  Each vendor then has their twist on the administration toolset, and has added their own extended capabilities. Cloudera Essential for Apache Hadoop  This is a 7 part recorded Webinar series. http://university.cloudera.com/onlineresources/clouderaessentials Here is where you can download the Cloudera VM image: http://www.cloudera.com/content/support/en/downloads.html ——————————————————————— MapR Academy A vast library of Webinars covering a multitude of subject areas.  Among these is a good walk-through with setting up the “Word Count” MapReduce program in Java (which is the “Hello World” program for Hadoop): http://www.mapr.com/academy/training-videos Hadoop Basics Cluster Management (this will be MapR-centric) Hadoop Use Cases Hadoop Admin Hadoop Developer Hadoop Business User Here is where you can download the MapR VM image/Sandbox: http://www.mapr.com/products/hadoop-download ——————————————————————— For the Oracle / PL/SQL programmers who are new to Java, here are links to a couple of docs from Peter Koletzke @ Quovera which help bridge the gap.  These were presented @ ODTUG 2013 (as well as at RMOUG in 2012).  These really have nothing to do with Hadoop specifically, but they do help the non-java programmer understand the Java programming structure.  Note, not all Hadoop programming requires Java.  There are several tools which are MapReduce generators, such as Hive and Pig.  As the ecosystem tools continue to evolve, our need for...