January23
It is the last seminar in this semester. We present four papers: 2 system paper and 2 algorithm paper.
The note is in messy and I just leave it here to make archive.
Efficient Outer Join Data Skew Handling in Parrell Database
VLDB 2009
Handling Data Skew in Parallel Joins in Shared-Nothing Systems
SIGMOD 2008
Yu Xu @ Teradata et al.
Xixian HAN
GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling
Sorabh Gandhi (University of California, Santa Barbara), Suman Nath (Microsoft Research), Subhash Suri (University of California, Santa Barbara), Jie Liu (Microsoft Research)
SIGMOD 2009
Yu LIU
Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks
Junchao Ma (The Hong Kong Polytechnic University, HK), Wei Lou (The Hong Kong Polytechnic University, HK), Xiang-Yang Li (Illinois Institute of Technology, US)
INFOCOMM 2009
Xiaolin FANG
Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies
Hideaki Kimura (Brown Univ.), George Huo (Google, Inc), Alexander Rasin (Brown Univ.), Samuel Madden (Massachusetts Institute of Technology), Stan Zdonik (Brown Univ.)
VLDB 2009 Slide
Xianming Liu
Handling Data Skew in Parallel Joins in Shared-Nothing Systems SIGMOD 2008,Yu Xu @ Teradata et al. This is a paper in the Industrial Session. It deals with the join operation in parallel database system
Table:
Two relation R(x,a) and S(y,b) distribute in several Parrellel Units.
Horizontally Partition the Table in R.x and R.y.
Operation:
Join[R,S](R.a,S.b)
Two solutions:
- Redistribution Plan
- Stage 1. Redistribute the relation based on the join attribute with hash function. When the attribute is luckily the partition attribute, the stage is not needed.
- Stage 2. Join in each PU. Advantage: Nearly linear speed-up on even balancing conditions. Disadvantage: In the case of data skew, where many row hash into the same machine and the machine becomes the hotspot, add more nodes wouldn’t help.
- Duplication Plan
- Stage 1. Broadcast the smallest relation to other PUs.
- Stage 2. Join in each PU. Advantage: Works when one join relation is very small Disadvantage: The work situation is rare to occur.
The idea is very simple: combine the two strategy. Assume we’ve known that L1 = {skew values of R.a}, L2 = {skew values of S.b}
When partition the R relation
- a is in L1, Stay.
- a is in L2, Partial Duplication.
- otherwise, Partial Redistribution.
Special Case: when a is in both L1 and L2, the system compare two list’s counts of a and only include a in the list with larger counts of a. Thus ensure L1 and L2 is disjoint finally.
The point is that we will not send all data into one machine and make it the bottleneck.
GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling, SIGMOD 08, UCSB and Microsoft
It is said the first paper try to multi sensor data.
Input:<br />K sensors, each one is a time series S_i={V_i(t),t = [1...n]}<br />template L = {L_1,L_2, ..., L_l}<br /><br />Cost Model: memory requirement SUM (m_i + d_i)<br /><br />Error bound: L-infinite error<br />
ISA: Interval Sharing approximation. Partition the series and use one value to represent a interval.
GAMP: Several interval maybe in different scale, scale it with linear equation.
Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks, Infocom 2009
The motivation is that sensor need to consume powers and time when get start up and close. We can design some schedule to make this process as less as possible.
Omit. I’m not interested in it.
Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies, VLDB 2009, Brown, Google, MIT.
- Correlations abound. There exist soft dependence between attributed, i.e. we want to find the city “Boston” then the state is “MA” with large possibility.
- Secondary index maybe useless because they are not clustered.
Core idea:
When making range queries on attribute which is not clustered, we can utilize correlation clustered attribute to optimize the query. This paper designs a correlation map structure to capture it.
Cost Model:
The core contribution of the paper is the cost model to make decision when to make the optimization
Challenge:
Challenge is how to update the structure.