Thursday, June 9, 2011

Extract date in required formats from hive tables


Problem: My hive table has date in the format ‘yyyy-MM-dd HH:mm:ss’ which was obtained from  a database table through SQOOP. I need it in my query of the format ‘yyyy-MM-dd ’.

Solution: Use a regex expression to extract the required value.
When we sqoop in the date value to hive from rdbms, the data type hive uses to store that date is String. You can use a substring functions to achieve the same, the most easiest way would be to use the regexp_extract() function in hive.

To get the date alone in ‘yyyy-MM-dd ’ format we can go in for
regexp_extract(column_datetime,'(.*\-.*\-.*)\\s(.*)',1)
You can use it as Select *, regexp_extract(column_datetime,'(.*\-.*\-.*)\\s(.*)',1) from sample_table;

**column_datetime is a column of type string that stores date in the format ‘yyyy-MM-dd HH:mm:ss’ which is part of hive table sample_table

Some possible cases
Case: 01
If you want the timestamp alone regexp_extract(column_datetime,'(.*)\\s(.*)',2)

Case: 02
If you want the year and month alone of the format ‘yyyy-MM’ then use regexp_extract(column_datetime,'(.*\-.*)\-.*',1)

Case: 03
All these were mere String operations done using regular expressions now if I want my date in the format ‘dd-MM- yyyy’ it’d be better to go in for a combination date functions and string concatenation functions. We can achieve our goal using this combination of functions

concat(year(column_datetime),'-',month(column_datetime),'-',day(column_datetime) )

So we can use the same in our query  like
Select *, concat(day(column_datetime),'-',month(column_datetime),'-', year(column_datetime))
 from sample_table;

But there would be issues in using this that if dates and month values are less than 10 it would be represented with a single character, we can overcome the same with the use of case functionality as

concat(CASE WHEN day(install_datetime) < 10 THEN concat('0',day(install_datetime)) ELSE trim(day(install_datetime)) END,'-',CASE WHEN month(install_datetime) < 10 THEN concat('0',month(install_datetime)) ELSE trim(month(install_datetime)) END,'-',year(install_datetime))

A sample usage in hive QL would be as
select *,concat(CASE WHEN day(install_datetime) < 10 THEN concat('0',day(install_datetime)) ELSE trim(day(install_datetime)) END,'-',CASE WHEN month(install_datetime) < 10 THEN concat('0',month(install_datetime)) ELSE trim(month(install_datetime)) END,'-',year(install_datetime)) from sample_table;

7 comments:

  1. Hi, the above info is very useful.
    I have this date format in string format
    (applied concat(year(column_datetime),'-',month(column_datetime),'-',day(column_datetime) ) ) function.

    I want to perform query, like group by month:
    Please suggest.

    2009-12-02
    2009-12-01
    2009-11-03
    2009-11-02
    2009-10-30
    2009-10-29

    ReplyDelete
  2. Hi Yogesh

    If you have the date in the specified format, ie like '2009-12-02' , you can do a group by on month as follows

    Select month(Date),count(*) FROM table_name GROUP BY month(Date)


    ReplyDelete
  3. what datatype could be used for date in this format 12/01/2012 ?
    any help

    ReplyDelete
  4. hi jose,

    Hive date data type does not exist, so for your date format you can use string as a data type.

    ReplyDelete
  5. Dude.. This is too complex. If all you want is the date'DD-MM-YYYY'. Just declare the data type as string and use a substring with ltrim or rtrim according to your requirement.

    ReplyDelete