Comments on Kick Start Hadoop: Enable Sorted Bucketing in Hive

Bejoy, Was really useful information. I have a qu...

2015-02-16T01:31:30.614-08:00

Bejoy,

Was really useful information. I have a question which I am posing as a use-case.

Say that there exists a bunch of ids that come from an external source. And there is the HDFS that has buckets hashed by id and ordered by id. There is a need to delete the records corresponding to the external ids.

1. Can I delete the records from a specific bucket ?
2. How can I make this process optimal ?
3. If deletions can take place at the bucket level, what happens if a few records are deleted ? Does it create a new bucket with the remaining data or are the deleted rows marked as null ?
4. What happens if all the rows in a bucket are deleted ?

The following is directly from the hive wiki: >...

2013-06-16T10:10:23.978-07:00

The following is directly from the hive wiki:

>> Cluster By is a short-cut for both Distribute By and Sort By. https://cwiki.apache.org/Hive/languagemanual-sortby.html

Ho Bejoy, would you be willing to explain why the ...

2013-06-16T09:44:34.539-07:00

Ho Bejoy, would you be willing to explain why the "cluster by" - which is a synonym for (distribute by && sort by) would then require (apparently redundant) "sort by" as well?

Hi I am not able execute any of the above queries ...

2013-03-26T07:44:56.570-07:00

Hi I am not able execute any of the above queries even loaded with data

To know the partitioning and bucketing information...

2013-03-25T05:58:52.641-07:00

To know the partitioning and bucketing information regarding an existing table use
DESCRIBE FORMATTED/EXTENDED

how would we know how many partitioned bucketed in...

2013-02-05T23:50:38.171-08:00

how would we know how many partitioned bucketed in table?

Is there any query it show before we use (bucket 3 out of n)
Please help me any body on this.

Hi Jeet 10 tables on a join has limited scope of ...

2012-09-10T10:18:50.870-07:00

Hi Jeet

10 tables on a join has limited scope of optimization. If you can break down the query then you can utilize the various optimization techniques and apply them individually. I can comment on it in a more detailed manner only if I know your use case, queries, data volume involved for each table and your cluster statistics.

What is Bucketed Map Join?
If two tables are bucked o the same key/keys on which the join is done and if one of the tables are medium sized then you can benefit from bucketed map join. When an input split from large table is processed by a mapper the corresponding bucket from smaller table can be loaded in memory and achieve a map side join and there by eliminating the reduce phase. This speeds up the join operation to a greater extent.

Can you Explain, how bucketing can optimize join? ...

2012-08-25T11:52:25.759-07:00

Can you Explain, how bucketing can optimize join?
Ex : i have 10 tables in hive.
I want to create a new table which will be a join result of all 10 tables.
Another question is can i do the join of 10 table in a single query and store the output in 1 table? and what i can think about to optimize the join query in hive(especially when it is a 10+ table join)?I need serious help on this.
Please explain it by the help of a code.

Hi Tiru, Bucking is basically used for sampling a...

2012-06-03T10:32:14.495-07:00

Hi Tiru,

Bucking is basically used for sampling and not for queries that point to certain data group. Say you want to calculate the average of age(one column in table) and you not looking at exact average but an approximate one. In those cases bucketing is used . Also bucketing is used to optimize joins as well.

Also when I mentioned '2 of n' buckets means, choose 2 buckets out of n. If I mention 'n on n' then whole data would be used as input.

Regards
Bejoy

hi very good post. i have one question. How do we...

2012-05-29T01:57:36.323-07:00

hi
very good post.
i have one question.
How do we know that required data is resided in 2 bucket?

thanks
tiru