Kick Start Hadoop: Implementing basic SQL Update statement in Hive

Wednesday, June 1, 2011

Implementing basic SQL Update statement in Hive

Hive is not meant for point to point queries and hence sql update functionality would be least required in hive that should be the reason hive doesn’t have update functionality for rows or rather individual columns in a row. There would be cases you find a much more suitable use case in hive, but the same can’t be implemented as it includes an update statement on a few rows specified by a condition. We can implement the basic sql row update in hive by following a few series of steps as follows

1. Load all the rows that has to be updated into a hdfs dir

a. Here in the select statement that fetches the rows that satisfy the condition, we specify the column names one by one rather than a Select *

b. In place of the colum whose value is to be updated we include the new value hard coded

2. Remove these rows loaded into hdfs dir in step 1 from table

a. Do an insert overwrite on the same table to include only those record that do not satisfies the condition

3. Load the updated rows from hdfs dir in Step into into table

Let’s look at the same through an example

We have a hive table employee with the following columns employeeId, employeeName, experienceMonths, salary and visaEligibility. We need to do an operation equivalent to the SQL query below in hive.

SQL Query

update table employee set visaEligibility = 'YES' where experienceMonths >36 ;

Equivalent Hive QL

#load the rows that fall in the condition into an hdfs dir with the corresponding column value modified

INSERT OVERWRITE DIRECTORY '/userdata/bejoy/employeetemp' SELECT employeeId,employeeName, experienceMonths ,salary,'YES' FROM employee WHERE experienceMonths > =36;

#removes the records from table that has been loaded into hdfs dir

#use the negate condition to the one provided in previous step

INSERT OVERWRITE TABLE employee SELECT * FROM employee WHERE experienceMonths < 36;

#load the rows from hdfs dir on to table on top of its current contents

LOAD DATA INPATH '/userdata/bejoy/employeetemp' INTO TABLE employee;

A better Approach

When I understood some more functionalities of hive, it was evident to me that there is a better way to solve this out using CASE statement. The same result could be obtained in a single hive query as stated below

INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;

The approaches and solutions described in this post are for non partitioned tables. Ideally in real rime scenarios the data volume would be too large to handle in a single partition so we have to go in for multiple partitions. But the same approaches could be used out there as well.

8 comments:

avinashyeluriJune 20, 2012 at 7:40 AM
Is there any way to do updates from another table
eg:
UPDATE A
FROM TEMP A
,TEMP01 B
SET GUART = B.GUART
WHERE A.NO = B.ANO;

thanks
ReplyDelete
Replies
Bejoy KSJune 20, 2012 at 8:10 AM
Hi,

It is possible but the operation is expensive as you are kind of overwriting the whole underlying data set.

Do a join between the distinct data sets(tables) and insert overwrite into the required table like mentioned in the blog post.

The join also can turn a little expensive here, as if you use a ON clause then the records will be filtered,ie only the matching records will be in output. So you may have yo go in for a full Cartesian join.

I'm not a SQL expert, just putting in my 2 cents for you to take it ahead.
ReplyDelete
Replies
Freaky-BiologistJanuary 22, 2013 at 9:50 PM
Hi Bejoy,

I am using hadoop database. I need to update the table in hadoop database using SQL commands in hive..

something like : hive -e "SQL Command". Can you give me an example for that.
ReplyDelete
Replies
Bejoy KSMarch 25, 2013 at 6:09 AM
Hi

I assume You are using hive. You can fire hive queries as
1) enter into hive CLI and give your query
2) if query is on a file execute as hive -f fileName

If from a java application use hive jdbc connector
ReplyDelete
Replies
UnknownApril 1, 2013 at 7:41 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 31, 2013 at 6:22 AM
hi everyone the query mentioned above for updation
INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;

is working for one column that is you can say if record have only one column id and name
but when more than two column it creates problem like column format changed updation in name appeared in marks etc everytime
dnt know what to do..
ReplyDelete
Replies
UnknownJuly 31, 2013 at 6:31 AM
i have created table record
as
201,ashu,sharma,500
202,channi,singh,600
203,mukesh,gupta,700
204,arun,mehta,800
when i used
INSERT OVERWRITE TABLE record PARTITION(dated='31072013')
SELECT id,fname,marks, CASE WHEN id=202 THEN 'yahoo' ELSE lname END AS lname
FROM record;
then it will return

201 sharma ashu 500 31072013
202 singh facebook yahoo 31072013
203 gupta mukesh 700 31072013
204 mehta arun 800 31072013
lastname yahoo has appeared in marks column
this error again and again dnt know wht to do
any help would be appreciated.
ReplyDelete
Replies
UnknownMarch 10, 2015 at 4:51 AM
I created 2 tables emp (name,id) and dept (name ,id) with following data,

(emp1,1),(emp2,2),(emp3,3)
AND
(dept1,1),(dept2,2),(dept3,3),(dept4,4),(dept5,5),(dept6,6).

I wish to update emp name where emp.id=dept.id and update it with dept name

I wrote following query:

INSERT OVERWRITE TABLE emp
SELECT e.id,
CASE WHEN e.id = d.id THEN d.name ELSE e.name END AS name FROM dept AS d,emp as e;

it is actually inserting 18 rows in the emp table instead of 3.
Please help!
ReplyDelete
Replies