Wednesday, June 1, 2011

Implementing basic SQL Update statement in Hive

Hive is not meant for point to point queries and hence sql update functionality would be least required in hive that should be the reason hive doesn’t have update functionality for rows or rather individual columns in a row. There would be cases you find a much more suitable use case in hive, but the same can’t be implemented as it includes an update statement on a few rows specified by a condition. We can implement the basic sql row update in hive by following a few series of steps as follows

1.       Load all the rows that has to be updated into a hdfs dir
a.       Here in the select statement that fetches the rows that satisfy the condition, we specify the column names one by one rather than a Select *
b.      In place of the colum whose value is to be updated we include the new value hard coded
2.       Remove these rows loaded into hdfs dir in step 1 from table
a.       Do an insert overwrite on the same table to include only those record that do not satisfies the condition
3.       Load the updated rows from hdfs dir in Step into into table

Let’s look at the same through an example
                We have a hive table employee with the following columns employeeId, employeeName, experienceMonths, salary and visaEligibility. We need to do an operation equivalent to the SQL query below in hive.

SQL Query
update table employee set visaEligibility = 'YES' where experienceMonths >36 ;

Equivalent Hive QL
#load the rows that fall in the condition into an hdfs dir with the corresponding column value modified
INSERT OVERWRITE DIRECTORY '/userdata/bejoy/employeetemp' SELECT employeeId,employeeName, experienceMonths ,salary,'YES' FROM employee WHERE  experienceMonths > =36;

#removes the records from table that has been loaded into hdfs dir
#use the negate condition to the one provided in previous step
INSERT OVERWRITE TABLE employee SELECT * FROM employee WHERE experienceMonths < 36;

#load the rows from hdfs dir on to table on top of its current contents
LOAD DATA INPATH '/userdata/bejoy/employeetemp' INTO TABLE employee;

A better Approach
                When I understood some more functionalities of hive, it was evident to me that there is a better way to solve this out using CASE statement. The same result could be obtained in a single hive query as stated below

INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;

The approaches and solutions described in this post are for non partitioned tables. Ideally in real rime scenarios the data volume would be too large to handle in a single partition so we have to go in for multiple partitions. But the same approaches could be used out there as well.

8 comments:

  1. Is there any way to do updates from another table
    eg:
    UPDATE A
    FROM TEMP A
    ,TEMP01 B
    SET GUART = B.GUART
    WHERE A.NO = B.ANO;

    thanks

    ReplyDelete
  2. Hi,

    It is possible but the operation is expensive as you are kind of overwriting the whole underlying data set.

    Do a join between the distinct data sets(tables) and insert overwrite into the required table like mentioned in the blog post.

    The join also can turn a little expensive here, as if you use a ON clause then the records will be filtered,ie only the matching records will be in output. So you may have yo go in for a full Cartesian join.

    I'm not a SQL expert, just putting in my 2 cents for you to take it ahead.

    ReplyDelete
  3. Hi Bejoy,

    I am using hadoop database. I need to update the table in hadoop database using SQL commands in hive..

    something like : hive -e "SQL Command". Can you give me an example for that.

    ReplyDelete
  4. Hi

    I assume You are using hive. You can fire hive queries as
    1) enter into hive CLI and give your query
    2) if query is on a file execute as hive -f fileName

    If from a java application use hive jdbc connector

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. hi everyone the query mentioned above for updation
    INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;

    is working for one column that is you can say if record have only one column id and name
    but when more than two column it creates problem like column format changed updation in name appeared in marks etc everytime
    dnt know what to do..

    ReplyDelete
  7. i have created table record
    as
    201,ashu,sharma,500
    202,channi,singh,600
    203,mukesh,gupta,700
    204,arun,mehta,800
    when i used
    INSERT OVERWRITE TABLE record PARTITION(dated='31072013')
    SELECT id,fname,marks, CASE WHEN id=202 THEN 'yahoo' ELSE lname END AS lname
    FROM record;
    then it will return

    201 sharma ashu 500 31072013
    202 singh facebook yahoo 31072013
    203 gupta mukesh 700 31072013
    204 mehta arun 800 31072013
    lastname yahoo has appeared in marks column
    this error again and again dnt know wht to do
    any help would be appreciated.

    ReplyDelete
  8. I created 2 tables emp (name,id) and dept (name ,id) with following data,

    (emp1,1),(emp2,2),(emp3,3)
    AND
    (dept1,1),(dept2,2),(dept3,3),(dept4,4),(dept5,5),(dept6,6).

    I wish to update emp name where emp.id=dept.id and update it with dept name

    I wrote following query:

    INSERT OVERWRITE TABLE emp
    SELECT e.id,
    CASE WHEN e.id = d.id THEN d.name ELSE e.name END AS name FROM dept AS d,emp as e;

    it is actually inserting 18 rows in the emp table instead of 3.
    Please help!

    ReplyDelete