Hive – “CREATE TABLE LIKE” drops some tblproperties

I ran across something today which I thought I should share regarding the “CREATE TABLE LIKE…” statement.  I found that not all of the tblproperties clauses of the originating table are copied to the new table definition.  Below is a quick test…. Note that the second table ends up missing the SKIP.LINE properties, but does keep the NULL FORMAT property. So, the moral of the story is to double check yourself when using the “CREATE TABLE LIKE” statement   drop table if exists test1; create external table test1 (val string) stored as textfile location ‘/test1’ tblproperties (‘skip.header.line.count’=’1’ ,‘skip.footer.line.count’=’1’ ,’serialization.null.format’=”); drop table if exists test2; create table test2 like jjohns008c.test1 location ‘/test2’; show create table test2; CREATE TABLE `test2`( `val` string) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.TextInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ LOCATION ‘hdfs://test2’ TBLPROPERTIES ( ‘serialization.null.format’=”,...

Shell loop by day

Quite often I find that I need to create quick Unix scripts to run a set of commands day by day.  Here’s a template loop I start with: # 1=start date YYYY-MM-DD # 2=end date YYYY-MM-DD CURR_DATE=${1} while [ `date +”%Y%m%d” -d “${CURR_DATE}”` -le `date +”%Y%m%d” -d “${2}”` ] do echo “Processing = ${CURR_DATE}” ## add your commands here… CURR_DATE=`date +”%Y-%m-%d” -d “${CURR_DATE} + 1 day”` done echo “ALL...

Hive – FitBit JSON example

Here’s a quick mockup of working with the JSON data (with nested array elements). 1) Create an HDFS directory: hadoop fs -mkdir -p /<hdfs_path>/fitbit 2) Copy your files to the local Unix server (into your home directory, perhaps) WINSCP / SCP / FTP / WGET, for example 3) Put your files into HDFS: hadoop fs -put /<local_unix_path>/*.json /<hdfs_path>/fitbit Then from the Hive CLI: use <your database or just use default>; drop table if exists fitbit; create external table fitbit ( `activities-heart` array<struct<customHeartRateZones:string ,dateTime:string ,heartRateZones:array<struct> ,value:string>> ,`activities-heart-intraday` struct<dataset:array<struct> ,datasetIntrerval:string ,datasetType:string> ) ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’ STORED AS textfile LOCATION ‘/<hdfs_path>/fitbit’ TBLPROPERTIES ( ‘serialization.null.format’=”); select * from fitbit; I just made up a couple of examples of querying the data: ———————————————————————- — How many calories did I burn each day? ———————————————————————- select datetime ,sum(calories) from (select expl.datetime ,expl.heartratezones.caloriesout as calories_array from fitbit lateral view explode (`activities-heart`) tbl as expl) z lateral view explode (calories_array) tbl as calories group by datetime; ———————————————————————- — How many times did my heart beat each day? ———————————————————————- select datetime ,sum(heartbeats) from (select expl.datetime ,expl2.value as heartbeats from fitbit lateral view explode (`activities-heart`) tbl as expl lateral view explode (`activities-heart-intraday`.dataset) tbl2 as expl2) z group by...

Common Java Errors

Here’s a common error which we’ve seen several times:   Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable   There’s an issue with doing a “show create table” and then executing the resulting “create table” statement if the table is ORC. When you do the “show create table”, you get this: STORED AS INPUTFORMAT   ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT   ‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’   But if you create the table with those clauses, you will then get the casting error when selecting.  To fix this, you need to change your “create table” statement to just this:                STORED AS ORC...

Free Hadoop Training Resources

There is an incredible volume of free resources online to help learn the Hadoop ecosystem.  I’ve gone through these resources, Webinars, and VM tutorials myself, and these will teach you the basics of the Hadoop landscape. A large portion of all of these resources are not vendor-specific.  The fundamentals of the ecosystem (HDFS, MapReduce, Hive, Pig, Sqoop, Flume, etc) are the same across all vendors.  Each vendor then has their twist on the administration toolset, and has added their own extended capabilities. Cloudera Essential for Apache Hadoop  This is a 7 part recorded Webinar series. http://university.cloudera.com/onlineresources/clouderaessentials Here is where you can download the Cloudera VM image: http://www.cloudera.com/content/support/en/downloads.html ——————————————————————— MapR Academy A vast library of Webinars covering a multitude of subject areas.  Among these is a good walk-through with setting up the “Word Count” MapReduce program in Java (which is the “Hello World” program for Hadoop): http://www.mapr.com/academy/training-videos Hadoop Basics Cluster Management (this will be MapR-centric) Hadoop Use Cases Hadoop Admin Hadoop Developer Hadoop Business User Here is where you can download the MapR VM image/Sandbox: http://www.mapr.com/products/hadoop-download ——————————————————————— For the Oracle / PL/SQL programmers who are new to Java, here are links to a couple of docs from Peter Koletzke @ Quovera which help bridge the gap.  These were presented @ ODTUG 2013 (as well as at RMOUG in 2012).  These really have nothing to do with Hadoop specifically, but they do help the non-java programmer understand the Java programming structure.  Note, not all Hadoop programming requires Java.  There are several tools which are MapReduce generators, such as Hive and Pig.  As the ecosystem tools continue to evolve, our need for...