4 R's of DB Research

Reading, Rithmetic, Research and wRiting

Note on how to do research

May14

I feel depressed recently while focusing on my thesis work. I realize that I must pause for a while and reflect on myself. I begin to read some advices of doing research and try to figure out how other people deal with the emotional and technical issues in research.

I read the report of How to do Research at the MIT lab (English)again and I find myself is not alone:-) Here is some notes of my reading.


0.1. Core Tips I summarized

  • Be an active reader.
    • Think what the reviewers think.
    • Think what the authors think.
    • Think what the programmers think.
  • Manange your time.
    • Focous on your central work.
    • Avoid any unnecessary activies not your central work.
    • For me, I should be aware of
      • perfectionism while programming.
      • collecting more paper while feel depressed

  • Communicate with other researchers.
    • You can get feedback and improve it.
    • It is helpful to relax the uncertainty and insecurity while doing research.


0.2. How to read?

“The amount of stuff you need to have read to have a solid grounding in the field may seem intimatingm, but you can read a core part in your research topic and make them as a start point.”

The secret of reading is that you should be an active. Reading while you focus on some problem is especially efficient. Be careful about the trap that “if only I know more X, the problem would be easy”. There’s always more to know that should be relevant.

Keep in mind that

  • How can I use this?
  • What is the motivation?
  • What are the choices the authors made(some are implicit)?
  • Are the formalization realistic?
  • What is the direction the paper suggest?
  • What is the problem lying offer the horizon
  • the paterns of difficulty that keep coming up in the author’s research program
  • the political points the paper may be aimed at
  • Programming or even just thinking about how to implent it is a good tool to encoruge active reading.

Don’t be too dependent on your advisor. Keep on talking and communicating with other people.


0.3. Choose a topic

A good thesis topic will simultaneously express a personal vision and participate in a conversation with the literature.

  • Your personal vision is your reason for being a scientist, an image or principle or idea or goal you care deeply about. A vision is always something big. You may not achieve your vision, but it can point the way.

  • At the same time, science is a conversation. You get feedback from other researchers and improve your work continuously.

    The hardest part is figuring out how to cut your problem down to a solvable size while keeping it big enough to be interesting. You will find you have to narrow your topic continuously.

    After you select your topic, you must answer the following problem. You should have one-sentence, one-paragraph, and five-minute answers. If you don’t know where you are going, people won’t take you seriously.

  • what’s the thesis of your thesis?
  • What are you trying to show?
  • When doing the work, be able to explain simply how each part of your theory and implementation is in service of the goal.
  • Try a simplified version of the thesis problem first. Work examples. Thoroughly explore some concrete instances before making an abstract theory.

Any work that is not central to your thesis should be minimized. Some activities to avoid (unless they are central to the thesis).

  • language design
  • user-interface or graphics hacking
  • inventing new formalisms
  • overoptimizing code
  • tool building
  • bureaucracy


0.4. Emotional Factors

  • Faith: Even the experience of failure will be useful in the future if you take it serious.
  • Time: Research is time-consuming. It usally takes three times as long as you expect to finish a subtask, even after you taking this rule into account.(It is also well-konown as the Hofstadter’s rule).
  • Life: When you decide to pursue a Ph.D., you should realize that reasearch will be part of your life. Many breakthroughts occur while you are in the shower or riding the subway or windowshopping in the Harvard Square.
  • Progress: Tell your mentors or friend your plan in a week and make progress.
  • Depression: You can get into a positive feedback loop (in which doubts about your ability to do the work) eat away at your enthusiams so that in fact you can’t get anything done. Realize that research ability is a learned skill, not inntate genius.
  • Insecurity and Uncertainty: There are no generally accepted standards of progress or of how to evaluate work. Yuo don’t know whether you make any progress. The feeling of uncertianty and self-doubting is inevitable. Communicate with more people is helpful. You may think your work is trival and obvious but in fact it is not. Talk with other people.


0.5. My Vision

My research interest is on various issues raised in novel data management and data mining especially in large graph data. Large graph data includes web hyperlink graph, online social network, and so on.

Two key properties of web and social network data are 1) large scale (distributed computing model )and 2) highly evolving (stream model).
Thus, developing scalable and online algorithm for web and social network search and mining is one primitive goal of my research.

To achieve my goal, I will build my framework of knowledge step by step. IMHO, I should have a solid backgrounding in

Besides of the techinical skills, other skills should pay attention to is

  • the skill to present your idea clearly and concisely
  • the skill to communicate with other people and keep contact

Powered by ScribeFire.

{Seminar}2010.05.13.

May13

There are 3 paper in our seminar this moring. I’ll outline the main problem of each paper.

  • Modeling and querying possible repairs in Duplicate Detection

    George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, Shai Ben-David

    VLDB 2009 Slide

    It shows a novel techniques to show the result of duplicate record with uncertain model. Possoible World of records is generated by hierarchy cluster algorithm with threshoudl limit. The paper also show the query processing in the Uncerain clean relationhip.

  • Reverse Top-k Queries

    Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis and Kjetil Nørvåg

    ICDE 2010

    Problem: Given top-k point, how to find the class of score function which result to the top-k point? The score function isthe linear combination of all attributes and the weight must be positive.

  • Minimizing the Cost of Mine Selection Via Sensor Networks

    C. Liu and G. Cao

    INFOCOM 2009

    Problem: Given $n$ mines, $m$ targets, how to select a mininum-cost subset of mines to destory each target $j$ with a probablity larger than $\sigma_j$?

    Solution: Greedy algorithms and layering algorithm

Powered by ScribeFire.

posted under Seminar | No Comments »

{Seminar}2010.01.23. – The Last Seminar in the Semester

January23

It is the last seminar in this semester. We present four papers: 2 system paper and 2 algorithm paper.

The note is in messy and I just leave it here to make archive.

Efficient Outer Join Data Skew Handling in Parrell Database

VLDB 2009

Handling Data Skew in Parallel Joins in Shared-Nothing Systems

SIGMOD 2008

Yu Xu @ Teradata et al.

Xixian HAN

GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling

Sorabh Gandhi (University of California, Santa Barbara), Suman Nath (Microsoft Research), Subhash Suri (University of California, Santa Barbara), Jie Liu (Microsoft Research)

SIGMOD 2009

Yu LIU

Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks

Junchao Ma (The Hong Kong Polytechnic University, HK), Wei Lou (The Hong Kong Polytechnic University, HK), Xiang-Yang Li (Illinois Institute of Technology, US)

INFOCOMM 2009

Xiaolin FANG

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies

Hideaki Kimura (Brown Univ.), George Huo (Google, Inc), Alexander Rasin (Brown Univ.), Samuel Madden (Massachusetts Institute of Technology), Stan Zdonik (Brown Univ.)

VLDB 2009 Slide

Xianming Liu

Handling Data Skew in Parallel Joins in Shared-Nothing Systems SIGMOD 2008,Yu Xu @ Teradata et al. This is a paper in the Industrial Session. It deals with the join operation in parallel database system

Table:
Two relation R(x,a) and S(y,b) distribute in several Parrellel Units.
Horizontally Partition the Table in R.x and R.y.

Operation:
Join[R,S](R.a,S.b)

Two solutions:

  1. Redistribution Plan
    • Stage 1. Redistribute the relation based on the join attribute with hash function. When the attribute is luckily the partition attribute, the stage is not needed.
    • Stage 2. Join in each PU. Advantage: Nearly linear speed-up on even balancing conditions. Disadvantage: In the case of data skew, where many row hash into the same machine and the machine becomes the hotspot, add more nodes wouldn’t help.
  2. Duplication Plan
    • Stage 1. Broadcast the smallest relation to other PUs.
    • Stage 2. Join in each PU. Advantage: Works when one join relation is very small Disadvantage: The work situation is rare to occur.

The idea is very simple: combine the two strategy. Assume we’ve known that L1 = {skew values of R.a}, L2 = {skew values of S.b}

When partition the R relation

  1. a is in L1, Stay.
  2. a is in L2, Partial Duplication.
  3. otherwise, Partial Redistribution.

Special Case: when a is in both L1 and L2, the system compare two list’s counts of a and only include a in the list with larger counts of a. Thus ensure L1 and L2 is disjoint finally.

The point is that we will not send all data into one machine and make it the bottleneck.

GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling, SIGMOD 08, UCSB and Microsoft

It is said the first paper try to multi sensor data.

Input:<br />K sensors, each one is a time series S_i={V_i(t),t = [1...n]}<br />template L = {L_1,L_2, ..., L_l}<br /><br />Cost Model: memory requirement SUM (m_i + d_i)<br /><br />Error bound: L-infinite error<br />

ISA: Interval Sharing approximation. Partition the series and use one value to represent a interval.

GAMP: Several interval maybe in different scale, scale it with linear equation.

Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks, Infocom 2009

The motivation is that sensor need to consume powers and time when get start up and close. We can design some schedule to make this process as less as possible.

Omit. I’m not interested in it.

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies, VLDB 2009, Brown, Google, MIT.

  • Correlations abound. There exist soft dependence between attributed, i.e. we want to find the city “Boston” then the state is “MA” with large possibility.
  • Secondary index maybe useless because they are not clustered.

Core idea:

When making range queries on attribute which is not clustered, we can utilize correlation clustered attribute to optimize the query. This paper designs a correlation map structure to capture it.

Cost Model:

The core contribution of the paper is the cost model to make decision when to make the optimization

Challenge:

Challenge is how to update the structure.

posted under Seminar | No Comments »

Note On Claremont Report

December25

The DB community is extremely HUGE and still growing. IMHO, it seems too ambitious and thus it is difficult to keep track of the field. However, maybe just because of this it is more interesting and exciting.

In the Claremont Report on Database Research, it mentioned many interesting research area

  • management of uncertain information
  • data privacy and security
  • human centric interactions with data
  • social network and Web 2.0
  • personalization and contextualization of query- and search-related tasks
  • Streaming and networked data

IMHO, it a revolution in DB research area by Web 2.0 and Scientific Computing

  1. Revisiting DB Engines: System Design Challenge
    • Workload is different such as web 2.0
    • Cluster is rising because of Map-Reduce
    • hardware is evolving
  2. Declarative Programming for Emerging Platforms
    • New language model? Ruby On Rails’s success of ORM
  3. The Interplay of Structured and Unstructured Data
    • Information Extraction: connection with IR & ML. Sematics of the Data.
    • Information Integration: Query Large Distributed Heterogeneous Data.
    • Created Data by User: It is an extremely attracting problem in the real world application.
      • Usability
      • Feedback Loop
      • Conflict Data
  4. Cloud Data Services
    • IMHO, it is just an alternative name of parallel and distributed computing.
    • What is the difference? The difference is that ordinary people now can easily utilize large cluster resources easily via Map/Reduce paradigm.
  5. Mobile Applications and Virtual Worlds
    • Many Applications: But what is the problem?
    • Privacy in Data Sharing: a HOT research area. There are two sessions in VLDB 2009 and SIGMOD 2009.

The Note seems in a mess. The point is that web 2.0’s challenge.

posted under Vision | No Comments »

[SDM 08] A Spamicity Approach to Web Spam Detection

December16

1. Backgroud

Two types of spam: link spam and term spam.

  • Term Spam: Add many keywords into one page and usually make them invisiable but searchable to cheat TF/IF
  • Link Spam: Construct links to boost pagerank score.

The paper deals with both conditions, but I’m only interested in the latter one.

i.e. Given a page p, determine whether p is link spam.

Related Paper:

  • B. Zhou. Mining page farms and its application in link spam detection. Master’s thesis. School of Computing Science, Simon Fraser University, 2007.
  • B. Zhou and J. Pei. Sketching landscapes of page farms. In SDM’07.
  • Bin Zhou, Jian Pei and Zhaohui Tang. “A Spamicity Approach to Web Spam Detection”. In Proceedings of the 2008 SIAM International Conference on Data Mining (SDM’08), Atlanta, Georgia, USA, April 24-26, 2008.

2. Intuition

Give a page p, what is the minimal essential sub-structure to compute PR(p)?

It consists of two component:

  • Directed Paths to p. All vertices in the path forced a vertices set H
  • All out-links of H. (Why? To reserve the transition probability of pages in H)

But, the structure is extremely large and even amount to the whole web graph. Thus, we impose some constraints to the sub-structure, capture local environment of page p, and make its size affordable.

Two constrains:

  • path length is not greater than k
  • PR(p,G(H)) is at least θ portion of PR(p,G)

The smallest subgraph under these constraints is called (θ, k)-farm of page p.

Now, it comes another interesting problem, extract the page farm of a given page p.

3. Problem

3.1. Assumption

What is a link spam?

The assumption is that: For a link spam p, Farm(p) try to make PR(p,Farm(p)) as high as possible.

We define the Spamicity(p) as ratio of PR(p,Farm(p)) against PR(p,OptFarm(p)).

**Link spam is a page whose spamicity exceed a threshold **

Given vertices and edges size of Farm(p), we can compute PR(p,OptFarm(p)) easily. Thus, the problem is how to extract Farm(p) and compute PR(p,Farm(p)).

3.2. Input

Given

  • a page p.
  • Web Graph G.
  • Parameters of page farm – (θ,k).
  • A threshold t.

3.3. Output

Spamicity(p) ? t

3.4. Solution

We can extract the (θ, k)-Farm(p) and then compute the utility. Unfortunately, the first problem is NP-hard. (Knapsack problem can reduce to it.) We choose approximation algorithms to extract the page farm of p.

3.4.1. Solution 1

Local Greedy Search Method

  1. Extract the (θ, k)-Farm(p):
    • Add pages with the highest page contribution to p into the farm until achieve a θ-portion of PR(p,G)
    • page contribution can be effective computed by page contribution
  2. Compute the Utility

3.4.2. Solution 2

Monotone Greedy Search Method Observation: If we construct farm(p) using the local greedy method(i.e add pages in the descending page contribution), the spamicity decreases monotonically. We adjust local greedy search method. Stop if

  • spamicity is lower than the threshold
  • OR all k-distance vertices have been added

4. Comments

  • Good: Local greedy search method, no need to know the whole web graph structure.
  • No theoretical guarantee on the approximation algorithm and it seems hard to proof it.
  • Experiment is on a spam test collection – 8,415 pages (normal: 7472, spam: 767, borderline: 176[Discarded]). Maybe more efficient on large web graph.

{Seminar}2009.12.12. – Bursty Traffic over Bursty Links.[4]

December12

Bursty Traffic over Bursty Links
Muhammad H Alizai, Olaf Landsiedel, Jó A Bitsch Link, Stefan Goetz, and Klaus Wehrle (RWTH)
SenSys 2009
Bo YU

It is a system paper working on the routing protocols in WSN, the idea behind it is very simple but the experiments are thorough.
Protocols in WSN aim to establish a routing tree rooted at base station.
There exists some bursty links — links temporarily stable — in WSN.
We may use these links to transmit data with less hops. The paper builds a Short-Term Link Estimator.

The algorithm comprised 4 steps:
1. Reliability:
If one node overhears another node several times, we guess there are temporarily reliable link.
2. Feasibility:
The overhearing node queries the routing protocols and
find the overhearing packet could be transmitted by this node faster(i.e. ),
then the bursty link is feasible
3. Announcement:
It informs the sender-node that it can use the bursty link to transmit data faster
4. Unavaibility
Once failure in receiving/transmitting data, the link is announced unavailable.
We use ETX metric — the number of retransimission required to a successful transmission — to measure the quality of a link.

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Distance-based Representative Skyline.[3]

December12

Distance-based Representative Skyline.
Yufei Tao (CUHK), Ling Ding (CUHK), Xuemin Lin (UNSW), Jian Pei (SFU)
ICDE 2009
Guohua LI

This paper maybe the hardest one in today’s seminar.
[Background]
Review of Skyline
Given a set D of multidimensional points skyline consists of the points that are not dominated by any other points.
X Dominates Y means X is smaller than Y in all dimensions.
In another word, skyline point implys that there at least one dimension of this point is smaller than others while other dimensions not.

What’s the k-representive skyline?
In this paper, it is defined by Representation Error — the maximum value of non-representive skyline point to its nearest skyline point.
Given Skyline S={s_1,s_2,…,s_n}
In 2-D space, the optimal substurue is obvious. We devide the skyline into two parts – one part consists of the first (k-1) representive skyline, another part consist of the kth skyline and its neighours.
The kth skyline
the other consists of the last one.
[Intuition]
For every non-representive skyline point, there is a nearby representive skyline point.
[Solution]
DP in 2D case.

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Indexing Boolean Expressions.[2]

December12

Indexing Boolean Expressions
Steven Whang (Stanford Univ.), Chad Brower (Yahoo! Research), Jayavel Shanmugasundaram (Yahoo! Research), Serguei Vassilvitskii (Yahoo! Research), Erik Vee (Yahoo! Research), Ramana Yerneni (Yahoo! Research), Hector Garcia-Molina (Stanford Univ.)
VLDB 2009 [slides]
Xianming LIU

The problem is simple but practical. I like this type of problems.

[Problem]
Input: Given a assignment, a large set of Boolean Expressions(BE)
Output: All satisified BEs  under the assignment.

[Application]
Online Advertisig:
A BE represents an advertiser’s user targeting requirements.
An assignment of value to attributes represents the characteristics of a user visiting an online page.
Publish/Subscribe System:
A BE represents a subscription.
An assignment of value to attributes represents an event.

[Solution]
Build a inverted index [Atomic Expression, BE position list] to index all expressions.
- Conjunction: Merge sorted list
- DNF: find the
- CNF:
What is the key difference between the IR joining algorithms

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Light-weight, Runtime Verification of Query Sources.[1]

December12

Light-weight, Runtime Verification of Query Sources.
Tingjian Ge (Brown U), Stan Zdonik (Brown U).
ICDE 2009
Lei YU

Background
Data owner outsources their data to the Network Storage System such as AWS.
The storage provider should ensure the underlying data not been modified by attackers
This paper address arbitrary modification of data incluing changes that may pass the DB.
In a world, it deals with modification in file system.

Q: What about Secure File system? What is the difference?
A: They are general purpose, not full using DB semantics? (No time to review, see paper)

“The basic idea is that we build one or more Merkle hash tree(s) for each sensitive DB object (e.g. specifica attributes and their associate index).
Then at query execution time, as we read data from the disk,
we verify that a minimum subset of “referenced” bytes from the sensitive DB objects is authentic,
thus providing that the data source and query result and both correct and complete
relateive to the initial “good state” from which the Merkle trees were built.”

The Merkle Hash Tree is a tree structure originally proposed for efficient authentication of equality queries in a DB sorted on the query attribute.
Every DB record corresponds to a distinct leaf node.
Each leaf node stores a hash value for the binary representation of its record.
The tree is constructed bottom-up, with each internal node storing the hash value of the concatenation of the hash value of its children.
With the root of the hash tree *and* the encryption key, we compute the signature of sensitive data.
*Query*
As a query runs, we need to verify it with Hash Tree by computing the signature again.
The server verifies the authenticity of the data it used from disk by computing hash values in a bottom-up manner, all the way up to the root.
Then we compared the signature with original signature to verified the authenticity.

*Update*
If we insert a data, we need to update the corresponding hash value in the tree.
The problem is if only part of the data modified, is it failure for all records?

*Organization*
Step 1. We split each page of a sensitive DB objects into several sub-blocks and construct a small in-page hash tree.
Step 2. All these in-pages hash trees make up a big hash tree.

SHA-1 is faster when the input is bigger message blocks

Q: Why we have to split a page?
A: Choose a fanout such that each hash operation is applied to a number of bytes that is close to the sub-block size

Q: How to split a page (i.e. what’s sub-block size of a page)?
A: In the experiment, try 3 size (1024, 512 and 256 bytes)
Tradeoff between time and space.
Small Size: quicker but bigger
Big Size: slower but smaller

Q: OK, then how to organize the in-page hash page ? (i.e. #fanout=?)
A: each hash operation applied to a number of bytes that is close to the sub-block size.

*Concurrency Control*
Two basic problems in CC, sync and reader/writer, merge some update operation to improver performance.
4 Phases in Writer
Synchronization Primitives and Reader/Writer
Write Transaction
Verification
Mark
Update
Helping

Q: Does the tree need to be balanced?

娱乐:
八卦一下 Merkle, 似乎很多数据结构都是以这个作为开头的
钻石时代

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Note on DKERC Semniar [0]

December12

We discuss four papers in this morning. See DKERC website for detail.
This is brief note on today DKERC Seminar. Because of time limit, the note maybe seem superficial , I will review them later.

I find it is hard to paraphrase some definition in original paper, thus I will just quote them sometimes :-)
The following is some nots about these papers,
{Seminar}2009.12.12. – Light-weight, Runtime Verification of Query Sources.[1]
{Seminar}2009.12.12. – Indexing Boolean Expressions.[2]
{Seminar}2009.12.12. – Distance-based Representative Skyline.[3]
{Seminar}2009.12.12. – Bursty Traffic over Bursty Links.[4]

posted under Seminar | No Comments »
« Older Entries