4 R's of DB Research

Reading, Rithmetic, Research and wRiting

{Seminar}2010.05.13.

May13

There are 3 paper in our seminar this moring. I’ll outline the main problem of each paper.

  • Modeling and querying possible repairs in Duplicate Detection

    George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, Shai Ben-David

    VLDB 2009 Slide

    It shows a novel techniques to show the result of duplicate record with uncertain model. Possoible World of records is generated by hierarchy cluster algorithm with threshoudl limit. The paper also show the query processing in the Uncerain clean relationhip.

  • Reverse Top-k Queries

    Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis and Kjetil Nørvåg

    ICDE 2010

    Problem: Given top-k point, how to find the class of score function which result to the top-k point? The score function isthe linear combination of all attributes and the weight must be positive.

  • Minimizing the Cost of Mine Selection Via Sensor Networks

    C. Liu and G. Cao

    INFOCOM 2009

    Problem: Given $n$ mines, $m$ targets, how to select a mininum-cost subset of mines to destory each target $j$ with a probablity larger than $\sigma_j$?

    Solution: Greedy algorithms and layering algorithm

Powered by ScribeFire.

posted under Seminar | No Comments »

{Seminar}2010.01.23. – The Last Seminar in the Semester

January23

It is the last seminar in this semester. We present four papers: 2 system paper and 2 algorithm paper.

The note is in messy and I just leave it here to make archive.

Efficient Outer Join Data Skew Handling in Parrell Database

VLDB 2009

Handling Data Skew in Parallel Joins in Shared-Nothing Systems

SIGMOD 2008

Yu Xu @ Teradata et al.

Xixian HAN

GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling

Sorabh Gandhi (University of California, Santa Barbara), Suman Nath (Microsoft Research), Subhash Suri (University of California, Santa Barbara), Jie Liu (Microsoft Research)

SIGMOD 2009

Yu LIU

Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks

Junchao Ma (The Hong Kong Polytechnic University, HK), Wei Lou (The Hong Kong Polytechnic University, HK), Xiang-Yang Li (Illinois Institute of Technology, US)

INFOCOMM 2009

Xiaolin FANG

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies

Hideaki Kimura (Brown Univ.), George Huo (Google, Inc), Alexander Rasin (Brown Univ.), Samuel Madden (Massachusetts Institute of Technology), Stan Zdonik (Brown Univ.)

VLDB 2009 Slide

Xianming Liu

Handling Data Skew in Parallel Joins in Shared-Nothing Systems SIGMOD 2008,Yu Xu @ Teradata et al. This is a paper in the Industrial Session. It deals with the join operation in parallel database system

Table:
Two relation R(x,a) and S(y,b) distribute in several Parrellel Units.
Horizontally Partition the Table in R.x and R.y.

Operation:
Join[R,S](R.a,S.b)

Two solutions:

  1. Redistribution Plan
    • Stage 1. Redistribute the relation based on the join attribute with hash function. When the attribute is luckily the partition attribute, the stage is not needed.
    • Stage 2. Join in each PU. Advantage: Nearly linear speed-up on even balancing conditions. Disadvantage: In the case of data skew, where many row hash into the same machine and the machine becomes the hotspot, add more nodes wouldn’t help.
  2. Duplication Plan
    • Stage 1. Broadcast the smallest relation to other PUs.
    • Stage 2. Join in each PU. Advantage: Works when one join relation is very small Disadvantage: The work situation is rare to occur.

The idea is very simple: combine the two strategy. Assume we’ve known that L1 = {skew values of R.a}, L2 = {skew values of S.b}

When partition the R relation

  1. a is in L1, Stay.
  2. a is in L2, Partial Duplication.
  3. otherwise, Partial Redistribution.

Special Case: when a is in both L1 and L2, the system compare two list’s counts of a and only include a in the list with larger counts of a. Thus ensure L1 and L2 is disjoint finally.

The point is that we will not send all data into one machine and make it the bottleneck.

GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling, SIGMOD 08, UCSB and Microsoft

It is said the first paper try to multi sensor data.

Input:<br />K sensors, each one is a time series S_i={V_i(t),t = [1...n]}<br />template L = {L_1,L_2, ..., L_l}<br /><br />Cost Model: memory requirement SUM (m_i + d_i)<br /><br />Error bound: L-infinite error<br />

ISA: Interval Sharing approximation. Partition the series and use one value to represent a interval.

GAMP: Several interval maybe in different scale, scale it with linear equation.

Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks, Infocom 2009

The motivation is that sensor need to consume powers and time when get start up and close. We can design some schedule to make this process as less as possible.

Omit. I’m not interested in it.

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies, VLDB 2009, Brown, Google, MIT.

  • Correlations abound. There exist soft dependence between attributed, i.e. we want to find the city “Boston” then the state is “MA” with large possibility.
  • Secondary index maybe useless because they are not clustered.

Core idea:

When making range queries on attribute which is not clustered, we can utilize correlation clustered attribute to optimize the query. This paper designs a correlation map structure to capture it.

Cost Model:

The core contribution of the paper is the cost model to make decision when to make the optimization

Challenge:

Challenge is how to update the structure.

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Note on DKERC Semniar [0]

December12

We discuss four papers in this morning. See DKERC website for detail.
This is brief note on today DKERC Seminar. Because of time limit, the note maybe seem superficial , I will review them later.

I find it is hard to paraphrase some definition in original paper, thus I will just quote them sometimes :-)
The following is some nots about these papers,
{Seminar}2009.12.12. – Light-weight, Runtime Verification of Query Sources.[1]
{Seminar}2009.12.12. – Indexing Boolean Expressions.[2]
{Seminar}2009.12.12. – Distance-based Representative Skyline.[3]
{Seminar}2009.12.12. – Bursty Traffic over Bursty Links.[4]

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Light-weight, Runtime Verification of Query Sources.[1]

December12

Light-weight, Runtime Verification of Query Sources.
Tingjian Ge (Brown U), Stan Zdonik (Brown U).
ICDE 2009
Lei YU

Background
Data owner outsources their data to the Network Storage System such as AWS.
The storage provider should ensure the underlying data not been modified by attackers
This paper address arbitrary modification of data incluing changes that may pass the DB.
In a world, it deals with modification in file system.

Q: What about Secure File system? What is the difference?
A: They are general purpose, not full using DB semantics? (No time to review, see paper)

“The basic idea is that we build one or more Merkle hash tree(s) for each sensitive DB object (e.g. specifica attributes and their associate index).
Then at query execution time, as we read data from the disk,
we verify that a minimum subset of “referenced” bytes from the sensitive DB objects is authentic,
thus providing that the data source and query result and both correct and complete
relateive to the initial “good state” from which the Merkle trees were built.”

The Merkle Hash Tree is a tree structure originally proposed for efficient authentication of equality queries in a DB sorted on the query attribute.
Every DB record corresponds to a distinct leaf node.
Each leaf node stores a hash value for the binary representation of its record.
The tree is constructed bottom-up, with each internal node storing the hash value of the concatenation of the hash value of its children.
With the root of the hash tree *and* the encryption key, we compute the signature of sensitive data.
*Query*
As a query runs, we need to verify it with Hash Tree by computing the signature again.
The server verifies the authenticity of the data it used from disk by computing hash values in a bottom-up manner, all the way up to the root.
Then we compared the signature with original signature to verified the authenticity.

*Update*
If we insert a data, we need to update the corresponding hash value in the tree.
The problem is if only part of the data modified, is it failure for all records?

*Organization*
Step 1. We split each page of a sensitive DB objects into several sub-blocks and construct a small in-page hash tree.
Step 2. All these in-pages hash trees make up a big hash tree.

SHA-1 is faster when the input is bigger message blocks

Q: Why we have to split a page?
A: Choose a fanout such that each hash operation is applied to a number of bytes that is close to the sub-block size

Q: How to split a page (i.e. what’s sub-block size of a page)?
A: In the experiment, try 3 size (1024, 512 and 256 bytes)
Tradeoff between time and space.
Small Size: quicker but bigger
Big Size: slower but smaller

Q: OK, then how to organize the in-page hash page ? (i.e. #fanout=?)
A: each hash operation applied to a number of bytes that is close to the sub-block size.

*Concurrency Control*
Two basic problems in CC, sync and reader/writer, merge some update operation to improver performance.
4 Phases in Writer
Synchronization Primitives and Reader/Writer
Write Transaction
Verification
Mark
Update
Helping

Q: Does the tree need to be balanced?

娱乐:
八卦一下 Merkle, 似乎很多数据结构都是以这个作为开头的
钻石时代

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Indexing Boolean Expressions.[2]

December12

Indexing Boolean Expressions
Steven Whang (Stanford Univ.), Chad Brower (Yahoo! Research), Jayavel Shanmugasundaram (Yahoo! Research), Serguei Vassilvitskii (Yahoo! Research), Erik Vee (Yahoo! Research), Ramana Yerneni (Yahoo! Research), Hector Garcia-Molina (Stanford Univ.)
VLDB 2009 [slides]
Xianming LIU

The problem is simple but practical. I like this type of problems.

[Problem]
Input: Given a assignment, a large set of Boolean Expressions(BE)
Output: All satisified BEs  under the assignment.

[Application]
Online Advertisig:
A BE represents an advertiser’s user targeting requirements.
An assignment of value to attributes represents the characteristics of a user visiting an online page.
Publish/Subscribe System:
A BE represents a subscription.
An assignment of value to attributes represents an event.

[Solution]
Build a inverted index [Atomic Expression, BE position list] to index all expressions.
- Conjunction: Merge sorted list
- DNF: find the
- CNF:
What is the key difference between the IR joining algorithms

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Distance-based Representative Skyline.[3]

December12

Distance-based Representative Skyline.
Yufei Tao (CUHK), Ling Ding (CUHK), Xuemin Lin (UNSW), Jian Pei (SFU)
ICDE 2009
Guohua LI

This paper maybe the hardest one in today’s seminar.
[Background]
Review of Skyline
Given a set D of multidimensional points skyline consists of the points that are not dominated by any other points.
X Dominates Y means X is smaller than Y in all dimensions.
In another word, skyline point implys that there at least one dimension of this point is smaller than others while other dimensions not.

What’s the k-representive skyline?
In this paper, it is defined by Representation Error — the maximum value of non-representive skyline point to its nearest skyline point.
Given Skyline S={s_1,s_2,…,s_n}
In 2-D space, the optimal substurue is obvious. We devide the skyline into two parts – one part consists of the first (k-1) representive skyline, another part consist of the kth skyline and its neighours.
The kth skyline
the other consists of the last one.
[Intuition]
For every non-representive skyline point, there is a nearby representive skyline point.
[Solution]
DP in 2D case.

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Bursty Traffic over Bursty Links.[4]

December12

Bursty Traffic over Bursty Links
Muhammad H Alizai, Olaf Landsiedel, Jó A Bitsch Link, Stefan Goetz, and Klaus Wehrle (RWTH)
SenSys 2009
Bo YU

It is a system paper working on the routing protocols in WSN, the idea behind it is very simple but the experiments are thorough.
Protocols in WSN aim to establish a routing tree rooted at base station.
There exists some bursty links — links temporarily stable — in WSN.
We may use these links to transmit data with less hops. The paper builds a Short-Term Link Estimator.

The algorithm comprised 4 steps:
1. Reliability:
If one node overhears another node several times, we guess there are temporarily reliable link.
2. Feasibility:
The overhearing node queries the routing protocols and
find the overhearing packet could be transmitted by this node faster(i.e. ),
then the bursty link is feasible
3. Announcement:
It informs the sender-node that it can use the bursty link to transmit data faster
4. Unavaibility
Once failure in receiving/transmitting data, the link is announced unavailable.
We use ETX metric — the number of retransimission required to a successful transmission — to measure the quality of a link.

posted under Seminar | No Comments »