4 R's of DB Research

Reading, Rithmetic, Research and wRiting

Skim on VLDB 2010

October9

Industry Sessions

感兴趣的industry paper

I09 p.1469: Distance-Based Outlier Detection: Consolidation and Renewed Bearing

异常点检测的实验

I12 p.1505: Confucius and Its Intelligent Disciples: Integrating Social with Search
Q&A Sites@Google

Research Sessions

Session 1: Database Security

R1 p.13: Building Disclosure Risk Aware Query Optimizers for Relational Databases

R2 p.25: Secure Personal Data Servers: a Vision Paper
大量的隐私数据被网络所收集: 建立个人Data Server, 用户决定自己的数据如何被使用(by whom, for how long, according to which rule, for which purpose)。很有意思的概念

R3 p.36: PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting

Session 2: Parallel and Distributed Databases

R4 p.48: Schism: a Workload-Driven Approach to Database Replication and Partitioning
R5 p.58: Ten Thousand SQLs: Parallel Keyword Queries Computing
R6 p.70: The Case for Determinism in Database Systems

Session 3: Data Exchange

数据集成/交换中的经典问题: Schema Mapping

R7 p.81: MapMerge: Correlating Independent Schema Mappings
R8 p.93: Chase Termination: A Constraints Rewriting Approach
R9 p.105: Scalable Data Exchange with Functional Dependencies

Session 4: Database Services and Applications

R10 p.117: Interactive Route Search in the Presence of Order Constraints
常规的地理search, 返回相关的实体集合,这里定义了一种新的查询方式

输入: 几个search queries 输出: A Route that goes via entities

这个过程是Interactive的,分步进行,用户每步给出相应的entities是否正确。

R11 p.129: Energy Management for MapReduce Clusters
Data Center的能源管理策略:1)关掉利用率低的机器,2)用cluster计算完成任务后,关掉整个cluster。结果是2)比1)好

R12 p.140: Toward Scalable Keyword Search over Relational Data
为了性能的提升, 返回部分结果,牺牲广度和精度。

Session 5: Data Models and Languages

R13 p.150: From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences
pattern matching over event stream 的语言

R14 p.162: Avalanche-Safe LINQ Compilation
LINQ

R15 p.173: Towards Certain Fixes with Editing Rules and Master Data
完整性约束只能找出错误,利用master data纠正错误

Session 6: Semantics 语义

R16 p.185: Explaining Missing Answers to SPJUA Queries
允许用户做出这样的查询: why certain tuples are not in the query results

R17 p.197: Sampling the Repairs of Functional Dependency Violations under Hard Constraints
修复违反FD(函数依赖)的数据。

传统方法1) 产生基于某种metric最优的repairs(优化问题);2)不直接产生repair,而是产生符合要求的answer

本文方法: 精心生成候选集,从中采样。

R18 p.208: Evaluating Entity Resolution Results
使用GMD measure作为ER结果的评估方法,类似于edit distance, 以split 和merge作为基本操作。

Session 7: Stream Databases

R19 p.220: High-Performance Dynamic Pattern Matching over Disordered Streams

Stream pattern matching

R20 p.232: SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems
分析Stream Processing Engines的引擎

R21 p.244: Recognizing Patterns in Streams with Imprecise Timestamps
不确定的流数据,以前假设流中的事件发生时刻已知且精确,这里假设未知或不精确。

Session 8: RDF and Graphs

R22 p.256: x-RDF-3X: Fast Querying, High Update Rates, and Consistency for RDF Databases

R23 p.264: Graph Pattern Matching: From Intractable to Polynomial Time
定义了一种新式的graph pattern: edge代表数据图中的连通性

R24 p.276: GRAIL: Scalable Reachability Index for Large Graphs
大图的可达索引,关注scalability

Session 9: Middleware Platforms for Data Management 中间件

R25 p.285: HaLoop: Efficient Iterative Data Processing on Large Clusters

R26 p.297: The Impact of Virtual Views on Containment

R27 p.309: Updatable and Evolvable Transforms for Virtual Databases

Session 10: Novel/Advanced Applications

R28 p.320: Navigating in Complex Mashed-Up Applications
Mashup and navigation

R29 p.330: Dremel: Interactive Analysis of Web-Scale Datasets
Interactive ad-hoc query system @ Google, 使用列存储nested data

R30 p.340: On Graph Query Optimization in Large Networks
Spath查询方法

Session 11: Ranking Queries

R31 p.352: Proximity Rank Join
已知: Relations, tuple = (score, 实值特征向量),给定一个目标vector

返回与目标最“接近”的k个tuples的组合。

R32 p.364: Identifying the Most Influential Data Objects with Reverse Top-k Queries
Reverse Top-k: 识别出最喜欢(受影响)的用户的集合,两个算法(SB和BB),这名字起得真是囧。

R33 p.373: Retrieving Top-k Prestige-Based Relevant Spatial Web Objects
对于查询,如果一个相关结果其附近的对象也是相关,那么这个结果就有较高的威望。基于此,进行location aware keyword search.

Session 12: Spatial and Temporal Databases

R34 p.385: Parsimonious Linear Fingerprinting for Time Series
利用数值序列的joint dynamic得到fingerprinting来进行 mining and summarizing, 优点是a)易于理解 b)应用广泛,压缩、聚类等c) 线性时间。

R35 p.397: The HV-tree: a Memory Hierarchy Aware Version Index
Version Index: 将数据渐进地迁移到更大(也更慢)的存储系统中. TSB Tree考虑了两级(MemoryàDisk), 本文考虑Cache与Memory

R36 p.409: Transforming Range Queries To Equivalent Box Queries To Optimize Page Access
将Range Queries转化为Box Queries优化页存取

Session 13: Record Linkage

R37 p.417: Record Linkage with Uniqueness Constraints and Erroneous Values
数据冲突问题,利用某些实体的属性满足唯一性约束(如身份证等),将问题转换为k-划分图问题。

R38 p.429: On-the-Fly Entity-Aware Query Processing in the Presence of Linkage
用Uncertainty解决实体识别,并且是online的

R39 p.439: Behavior Based Record Linkage
利用transaction log计算行为,以及合并2个表面上看起来不同但是行为相同的结果对应的代价来决定是否合并

Session 14: Experimental Analysis and Performance

R40 p.449: iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques
目前的Graph Query采用的方法是利用index 过滤,然后进行子图同构验证。

R41 p.460: Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance
EC2性能测试

R42 p.472: The Performance of MapReduce: An In-depth Study
Handoop性能测试

R43 p.484: Evaluation of entity resolution approaches on real-world match problems
实体识别实验,使用真实世界的match task

Session 15: Cloud Computing

R44 p.494: MRShare: Sharing Across Multiple Queries in MapReduce
共享相似的查询结果

R45 p.506: Towards Elastic Transactional Cloud Storage with Range Query Support
云存储系统

R46 p.518: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)
让一头黄色的大象跑得像猎豹一样

Session 16: Query Processing I

R47 p.530: Slicing Long-Running Queries
切分运行时间长的Query

R48 p.542: Sharing-Aware Horizontal Partitioning for Exploiting Correlations During Query Processing

R49 p.554: Advanced Processing for Ontological Queries

Session 17: Data Extraction

R50 p.566: Towards The Web of Concepts: Extracting Concepts from Large Datasets
Concept代表实体的词序列(句子?)

R51 p.578: Exploiting Content Redundancy for Web Information Extraction
抽取template-based 的网页

R52 p.588: Automatic Rule Refinement for Information Extraction
基于规则的自动信息提取

Session 18: Privacy

R53 p.598: Embellishing Text Search Queries To Protect User Privacy
文本搜索时关键词导致的隐私泄露

R54 p.608: Small Domain Randomization: Same Privacy, More Utility
传统的Privacy问题

R55 p.619: Nearest Neighbor Search with Strong Location Privacy
NN search位置隐私

Session 19: Probabilistic and Uncertain Databases

R56 p.630: UPI: A Primary Index for Uncertain Databases
不确定数据库的主索引

R57 p.638: Ranking Continuous Probabilistic Datasets
连续概率分布上的rank查询

R58 p.650: Set Similarity Join on Probabilistic Data
概率数据上的set similarity join

Session 20: Databases on Modern Hardware

R59 p.660: Complex Event Detection at Wire Speed with FPGAs

R60 p.670: Database Compression on Graphics Processors

R61 p.681: Aether: A Scalable Approach to Logging

Session 21: Data Mining

R62 p.693: Scalable Discovery of Best Clusters on Large Graphs
Top-Graph Clusters, 只返回最可能的那几个cluster

R63 p.703: An Architecture for Parallel Topic Models

R64 p.711: Keyword++: A Framework to Improve Keyword Search Over Entity Databases
通过实体数据库改进搜索质量

Session 22: Moving Object Databases

R65 p.723: Swarm: Mining Relaxed Temporal Moving Object Clusters
对象可能暂时分散,但是在某个时间点聚集,以前的定义要求对象在一个连续的时间距离,这里定义了一种新的结构,称为swarm

R66 p.735: An Adaptive Updating Protocol for Reducing Moving Object Databases Workload
(location 和speed)的更新问题,允许延迟的更新,并根据结果调整。

R67 p.747: Shortest Path Computation on Air Indexes
Mobile Road Network中的最短路径查询

Session 23: Probabilistic Data

R68 p.758: Efficient and Effective Similarity Search over Probabilistic Data based on Earth Mover’s Distance
基于Earth Mover’s Distance的similarity search, 利用对偶原理以及B树剪枝。

R69 p.770: Probabilistic XML via Markov Chains

Session 24: Fuzzy, Probabilistic and Approximate Databases

R70 p.782: MCDB-R: Risk Analysis in the Database
不确定性被建模在不确定值的概率分布,导致了其查询结果也是个概率分布。企业通常关心分布的lower or upper tail(风险评估)。这里通过改进的MCDB能获取tail的sample.

R71 p.794: Scalable Probabilistic Databases with Factor Graphs and MCMC
一种新的概率数据库建模方法,利用factor graph编码分布,利用MCMC recover, 低层仍然采用single world模型

Session 25: Discovery and Exploration

R72 p.805: On Multi-Column Foreign Key Discovery
对于没有给出外键约束的DB自动发现约束

R73 p.815: Explore or Exploit? Effective Strategies for Disambiguating Large Databases
利用上下文clean data

Session 26: Information Filtering and Dissemination

R74 p.826: Building Ranked Mashups of Unstructured Sources with Uncertain Information

R75 p.838: Computing Closed Skycubes

Session 27: Query Processing II

R76 p.848: Generating Databases for Query Workloads

R77 p.860: Processing Top-k Join Queries

R78 p.871: Two-way Replacement Selection
这个貌似很NB。Merge Sort生成的runs的replacement策略的改进

Session 28: XML Data

R79 p.882: XPath Whole Query Optimization

R80 p.894: Fast Optimal Twig Joins
R81 p.906: Destabilizers and Independence of XML Updates

Session 29: Workflows, Transactions and Business Processes

R82 p.918: Searching Workflows with Hierarchical Views

R83 p.928: Data-Oriented Transaction Execution

R84 p.940: Optimal Top-K Query Evaluation for Weighted Business Processes

Session 30: Scientific databases

R85 p.952: Behavioral Simulations in MapReduce

R86 p.964: A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays

R87 p.975: On Dense Pattern Mining in Graph Streams
频繁的dense pattern挖掘

Session 31: Mobility and Spatial Queries

R88 p.985: Efficient Proximity Detection among Mobile Users via Self-Tuning Policies
给定用户之间的threshold, 找出和其proximity在threshold在某threshold一下的用户

R89 p.997: k-Nearest Neighbors in Uncertain Graphs
不确定图的kNN

R90 p.1009: Mining Significant Semantic Locations From GPS Data
利用GPS位置信息获取语义信息,建立location, user的Graph,利用RW设定significance

Session 32: Data Anonymization Techniques

R91 p.1021: Boosting the Accuracy of Differentially Private Histograms Through Consistency

R92 p.1033: rho-uncertainty: Inference-Proof Transaction Anonymization

R93 p.1045: Minimizing Minimality and Maximizing Utility: Analyzing Method-based attacks on Anonymized Data

Session 33: Querying and Integrating Probabilistic Databases

R94 p.1057: Querying Probabilistic Information Extraction
综合IE+Query步骤

R95 p.1068: Read-Once Functions and Query Evaluation in Probabilistic Databases
PDB的Query Evaluation

R96 p.1080: Foundations of Uncertain-Data Integration

Session 34: Database Design

R97 p.1091: Identifying, Attributing and Describing Spatial Bursts
利用用户生成数据识别地理集中的事件

R98 p.1103: CORADD: Correlation Aware Database Designer for Materialized Views and Indexes

R99 p.1114: Regret-Minimizing Representative Databases
定义了一种新的查询k-regret查询

Session 35: Query Optimization

R100 p.1125: An Access Cost-Aware Approach for Object Retrieval over Multiple Sources
多数据源选择

R101 p.1137: On the Stability of Plan Costs and the Costs of Plan Stability
参数化的计划生成与选择

R102 p.1149: Xplus: A SQL-Tuning-Aware Query Optimizer

Session 36: Graph and Pattern Matching

R103 p.1161: Graph Homomorphism Revisited for Graph Matching
p-同构

R104 p.1173: SigMatch: Fast and Scalable Multi-Pattern Matching
字符多目标匹配

R105 p.1185: SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs
找出匹配图出现的次数(with possible missing edges)

Session 37: Indexing Techniques

R106 p.1195: Tree Indexing on Solid State Drives

R107 p.1207: Efficient B-tree Based Indexing for Cloud Data Processing

R108 p.1219: Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints

Session 38: Query Processing III

R109 p.1231: VoR-Tree: R-trees with Voronoi Diagrams for Efficient Processing of Spatial Nearest Neighbor Queries
R-tree with Voroni Diagrams

R110 p.1243: Efficient RkNN Retrieval with Arbitrary Non-Metric Similarity Measures
RkNN

R111 p.1255: Efficient Skyline Evaluation over Partially Ordered Domains

Session 39: Streaming and Sensor Data

R112 p.1267: Achieving High Output Quality under Limited Resources through Structure-based Spilling in XML Streams

R113 p.1279: Dynamic Join Optimization in Multi-Hop Wireless Sensor Networks

R114 p.1291: Database-support for Continuous Prediction Queries over Streaming Data

R115 p.1302: Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations

Session 40: Information Integration and Retrieval

R116 p.1314: TRAMP: Understanding the Behavior of Schema Mappings through Provenance

R117 p.1326: Entity Resolution with Evolving Rules

R118 p.1338: Annotating and Searching Web Tables Using Entities, Types and Relationships

Session 41: Data Mining, Copy Detection and Data Publication

R119 p.1348: Interesting-Phrase Mining for Ad-Hoc Text Analytics

R120 p.1358: Global Detection of Complex Copying Relationships Between Sources

R121 p.1370: Fragments and Loose Associations: Respecting Privacy in Data Publishing

posted under DB | No Comments »

Note on how to do research

May14

I feel depressed recently while focusing on my thesis work. I realize that I must pause for a while and reflect on myself. I begin to read some advices of doing research and try to figure out how other people deal with the emotional and technical issues in research.

I read the report of How to do Research at the MIT lab (English)again and I find myself is not alone:-) Here is some notes of my reading.

0.1. Core Tips I summarized

  • Be an active reader.
    • Think what the reviewers think.
    • Think what the authors think.
    • Think what the programmers think.
  • Manange your time.
    • Focous on your central work.
    • Avoid any unnecessary activies not your central work.
    • For me, I should be aware of
      • perfectionism while programming.
      • collecting more paper while feel depressed
  • Communicate with other researchers.
    • You can get feedback and improve it.
    • It is helpful to relax the uncertainty and insecurity while doing research.

0.2. How to read?

“The amount of stuff you need to have read to have a solid grounding in the field may seem intimidating, but you can read a core part in your research topic and make them as a start point.”

The secret of reading is that you should be an active. Reading while you focus on some problem is especially efficient. Be careful about the trap that “if only I know more X, the problem would be easy”. There’s always more to know that should be relevant.

Keep in mind that

  • How can I use this?
  • What is the motivation?
  • What are the choices the authors made(some are implicit)?
  • Are the formalization realistic?
  • What is the direction the paper suggest?
  • What is the problem lying offer the horizon
  • the paterns of difficulty that keep coming up in the author’s research program
  • the political points the paper may be aimed at
  • Programming or even just thinking about how to implent it is a good tool to encoruge active reading.

Don’t be too dependent on your advisor. Keep on talking and communicating with other people.

0.3. Choose a topic

A good thesis topic will simultaneously express a personal vision and participate in a conversation with the literature.

  • Your personal vision is your reason for being a scientist, an image or principle or idea or goal you care deeply about. A vision is always something big. You may not achieve your vision, but it can point the way.
  • At the same time, science is a conversation. You get feedback from other researchers and improve your work continuously.

    The hardest part is figuring out how to cut your problem down to a solvable size while keeping it big enough to be interesting. You will find you have to narrow your topic continuously.

    After you select your topic, you must answer the following problem. You should have one-sentence, one-paragraph, and five-minute answers. If you don’t know where you are going, people won’t take you seriously.

  • what’s the thesis of your thesis?
  • What are you trying to show?
  • When doing the work, be able to explain simply how each part of your theory and implementation is in service of the goal.
  • Try a simplified version of the thesis problem first. Work examples. Thoroughly explore some concrete instances before making an abstract theory.

Any work that is not central to your thesis should be minimized. Some activities to avoid (unless they are central to the thesis).

  • language design
  • user-interface or graphics hacking
  • inventing new formalisms
  • overoptimizing code
  • tool building
  • bureaucracy

0.4. Emotional Factors

  • Faith: Even the experience of failure will be useful in the future if you take it serious.
  • Time: Research is time-consuming. It usally takes three times as long as you expect to finish a subtask, even after you taking this rule into account.(It is also well-konown as the Hofstadter’s rule).
  • Life: When you decide to pursue a Ph.D., you should realize that reasearch will be part of your life. Many breakthroughts occur while you are in the shower or riding the subway or windowshopping in the Harvard Square.
  • Progress: Tell your mentors or friend your plan in a week and make progress.
  • Depression: You can get into a positive feedback loop (in which doubts about your ability to do the work) eat away at your enthusiams so that in fact you can’t get anything done. Realize that research ability is a learned skill, not inntate genius.
  • Insecurity and Uncertainty: There are no generally accepted standards of progress or of how to evaluate work. Yuo don’t know whether you make any progress. The feeling of uncertianty and self-doubting is inevitable. Communicate with more people is helpful. You may think your work is trival and obvious but in fact it is not. Talk with other people.

0.5. My Vision

My research interest is on various issues raised in novel data management and data mining especially in large graph data. Large graph data includes web hyperlink graph, online social network, and so on.

Two key properties of web and social network data are 1) large scale (distributed computing model )and 2) highly evolving (stream model).
Thus, developing scalable and online algorithm for web and social network search and mining is one primitive goal of my research.

To achieve my goal, I will build my framework of knowledge step by step. IMHO, I should have a solid backgrounding in

Besides of the techinical skills, other skills should pay attention to is

  • the skill to present your idea clearly and concisely
  • the skill to communicate with other people and keep contact

Powered by ScribeFire.

{Seminar}2010.05.13.

May13

There are 3 paper in our seminar this moring. I’ll outline the main problem of each paper.

  • Modeling and querying possible repairs in Duplicate Detection

    George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, Shai Ben-David

    VLDB 2009 Slide

    It shows a novel techniques to show the result of duplicate record with uncertain model. Possoible World of records is generated by hierarchy cluster algorithm with threshoudl limit. The paper also show the query processing in the Uncerain clean relationhip.

  • Reverse Top-k Queries

    Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis and Kjetil Nørvåg

    ICDE 2010

    Problem: Given top-k point, how to find the class of score function which result to the top-k point? The score function isthe linear combination of all attributes and the weight must be positive.

  • Minimizing the Cost of Mine Selection Via Sensor Networks

    C. Liu and G. Cao

    INFOCOM 2009

    Problem: Given $n$ mines, $m$ targets, how to select a mininum-cost subset of mines to destory each target $j$ with a probablity larger than $\sigma_j$?

    Solution: Greedy algorithms and layering algorithm

Powered by ScribeFire.

posted under Seminar | No Comments »

{Seminar}2010.01.23. – The Last Seminar in the Semester

January23

It is the last seminar in this semester. We present four papers: 2 system paper and 2 algorithm paper.

The note is in messy and I just leave it here to make archive.

Efficient Outer Join Data Skew Handling in Parrell Database

VLDB 2009

Handling Data Skew in Parallel Joins in Shared-Nothing Systems

SIGMOD 2008

Yu Xu @ Teradata et al.

Xixian HAN

GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling

Sorabh Gandhi (University of California, Santa Barbara), Suman Nath (Microsoft Research), Subhash Suri (University of California, Santa Barbara), Jie Liu (Microsoft Research)

SIGMOD 2009

Yu LIU

Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks

Junchao Ma (The Hong Kong Polytechnic University, HK), Wei Lou (The Hong Kong Polytechnic University, HK), Xiang-Yang Li (Illinois Institute of Technology, US)

INFOCOMM 2009

Xiaolin FANG

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies

Hideaki Kimura (Brown Univ.), George Huo (Google, Inc), Alexander Rasin (Brown Univ.), Samuel Madden (Massachusetts Institute of Technology), Stan Zdonik (Brown Univ.)

VLDB 2009 Slide

Xianming Liu

Handling Data Skew in Parallel Joins in Shared-Nothing Systems SIGMOD 2008,Yu Xu @ Teradata et al. This is a paper in the Industrial Session. It deals with the join operation in parallel database system

Table:
Two relation R(x,a) and S(y,b) distribute in several Parrellel Units.
Horizontally Partition the Table in R.x and R.y.

Operation:
Join[R,S](R.a,S.b)

Two solutions:

  1. Redistribution Plan
    • Stage 1. Redistribute the relation based on the join attribute with hash function. When the attribute is luckily the partition attribute, the stage is not needed.
    • Stage 2. Join in each PU. Advantage: Nearly linear speed-up on even balancing conditions. Disadvantage: In the case of data skew, where many row hash into the same machine and the machine becomes the hotspot, add more nodes wouldn’t help.
  2. Duplication Plan
    • Stage 1. Broadcast the smallest relation to other PUs.
    • Stage 2. Join in each PU. Advantage: Works when one join relation is very small Disadvantage: The work situation is rare to occur.

The idea is very simple: combine the two strategy. Assume we’ve known that L1 = {skew values of R.a}, L2 = {skew values of S.b}

When partition the R relation

  1. a is in L1, Stay.
  2. a is in L2, Partial Duplication.
  3. otherwise, Partial Redistribution.

Special Case: when a is in both L1 and L2, the system compare two list’s counts of a and only include a in the list with larger counts of a. Thus ensure L1 and L2 is disjoint finally.

The point is that we will not send all data into one machine and make it the bottleneck.

GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling, SIGMOD 08, UCSB and Microsoft

It is said the first paper try to multi sensor data.

Input:<br />K sensors, each one is a time series S_i={V_i(t),t = [1...n]}<br />template L = {L_1,L_2, ..., L_l}<br /><br />Cost Model: memory requirement SUM (m_i + d_i)<br /><br />Error bound: L-infinite error<br />

ISA: Interval Sharing approximation. Partition the series and use one value to represent a interval.

GAMP: Several interval maybe in different scale, scale it with linear equation.

Energy Efficient TDMA Sleep Scheduling in Wireless Sensor Networks, Infocom 2009

The motivation is that sensor need to consume powers and time when get start up and close. We can design some schedule to make this process as less as possible.

Omit. I’m not interested in it.

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies, VLDB 2009, Brown, Google, MIT.

  • Correlations abound. There exist soft dependence between attributed, i.e. we want to find the city “Boston” then the state is “MA” with large possibility.
  • Secondary index maybe useless because they are not clustered.

Core idea:

When making range queries on attribute which is not clustered, we can utilize correlation clustered attribute to optimize the query. This paper designs a correlation map structure to capture it.

Cost Model:

The core contribution of the paper is the cost model to make decision when to make the optimization

Challenge:

Challenge is how to update the structure.

posted under Seminar | No Comments »

Note On Claremont Report

December25

The DB community is extremely HUGE and still growing. IMHO, it seems too ambitious and thus it is difficult to keep track of the field. However, maybe just because of this it is more interesting and exciting.

In the Claremont Report on Database Research, it mentioned many interesting research area

  • management of uncertain information
  • data privacy and security
  • human centric interactions with data
  • social network and Web 2.0
  • personalization and contextualization of query- and search-related tasks
  • Streaming and networked data

IMHO, it a revolution in DB research area by Web 2.0 and Scientific Computing

  1. Revisiting DB Engines: System Design Challenge
    • Workload is different such as web 2.0
    • Cluster is rising because of Map-Reduce
    • hardware is evolving
  2. Declarative Programming for Emerging Platforms
    • New language model? Ruby On Rails’s success of ORM
  3. The Interplay of Structured and Unstructured Data
    • Information Extraction: connection with IR & ML. Sematics of the Data.
    • Information Integration: Query Large Distributed Heterogeneous Data.
    • Created Data by User: It is an extremely attracting problem in the real world application.
      • Usability
      • Feedback Loop
      • Conflict Data
  4. Cloud Data Services
    • IMHO, it is just an alternative name of parallel and distributed computing.
    • What is the difference? The difference is that ordinary people now can easily utilize large cluster resources easily via Map/Reduce paradigm.
  5. Mobile Applications and Virtual Worlds
    • Many Applications: But what is the problem?
    • Privacy in Data Sharing: a HOT research area. There are two sessions in VLDB 2009 and SIGMOD 2009.

The Note seems in a mess. The point is that web 2.0′s challenge.

posted under Vision | No Comments »

[SDM 08] A Spamicity Approach to Web Spam Detection

December16

1. Backgroud

Two types of spam: link spam and term spam.

  • Term Spam: Add many keywords into one page and usually make them invisiable but searchable to cheat TF/IF
  • Link Spam: Construct links to boost pagerank score.

The paper deals with both conditions, but I’m only interested in the latter one.

i.e. Given a page p, determine whether p is link spam.

Related Paper:

  • B. Zhou. Mining page farms and its application in link spam detection. Master’s thesis. School of Computing Science, Simon Fraser University, 2007.
  • B. Zhou and J. Pei. Sketching landscapes of page farms. In SDM’07.
  • Bin Zhou, Jian Pei and Zhaohui Tang. “A Spamicity Approach to Web Spam Detection”. In Proceedings of the 2008 SIAM International Conference on Data Mining (SDM’08), Atlanta, Georgia, USA, April 24-26, 2008.

2. Intuition

Give a page p, what is the minimal essential sub-structure to compute PR(p)?

It consists of two component:

  • Directed Paths to p. All vertices in the path forced a vertices set H
  • All out-links of H. (Why? To reserve the transition probability of pages in H)

But, the structure is extremely large and even amount to the whole web graph. Thus, we impose some constraints to the sub-structure, capture local environment of page p, and make its size affordable.

Two constrains:

  • path length is not greater than k
  • PR(p,G(H)) is at least θ portion of PR(p,G)

The smallest subgraph under these constraints is called (θ, k)-farm of page p.

Now, it comes another interesting problem, extract the page farm of a given page p.

3. Problem

3.1. Assumption

What is a link spam?

The assumption is that: For a link spam p, Farm(p) try to make PR(p,Farm(p)) as high as possible.

We define the Spamicity(p) as ratio of PR(p,Farm(p)) against PR(p,OptFarm(p)).

**Link spam is a page whose spamicity exceed a threshold **

Given vertices and edges size of Farm(p), we can compute PR(p,OptFarm(p)) easily. Thus, the problem is how to extract Farm(p) and compute PR(p,Farm(p)).

3.2. Input

Given

  • a page p.
  • Web Graph G.
  • Parameters of page farm – (θ,k).
  • A threshold t.

3.3. Output

Spamicity(p) ? t

3.4. Solution

We can extract the (θ, k)-Farm(p) and then compute the utility. Unfortunately, the first problem is NP-hard. (Knapsack problem can reduce to it.) We choose approximation algorithms to extract the page farm of p.

3.4.1. Solution 1

Local Greedy Search Method

  1. Extract the (θ, k)-Farm(p):
    • Add pages with the highest page contribution to p into the farm until achieve a θ-portion of PR(p,G)
    • page contribution can be effective computed by page contribution
  2. Compute the Utility

3.4.2. Solution 2

Monotone Greedy Search Method Observation: If we construct farm(p) using the local greedy method(i.e add pages in the descending page contribution), the spamicity decreases monotonically. We adjust local greedy search method. Stop if

  • spamicity is lower than the threshold
  • OR all k-distance vertices have been added

4. Comments

  • Good: Local greedy search method, no need to know the whole web graph structure.
  • No theoretical guarantee on the approximation algorithm and it seems hard to proof it.
  • Experiment is on a spam test collection – 8,415 pages (normal: 7472, spam: 767, borderline: 176[Discarded]). Maybe more efficient on large web graph.

{Seminar}2009.12.12. – Bursty Traffic over Bursty Links.[4]

December12

Bursty Traffic over Bursty Links
Muhammad H Alizai, Olaf Landsiedel, Jó A Bitsch Link, Stefan Goetz, and Klaus Wehrle (RWTH)
SenSys 2009
Bo YU

It is a system paper working on the routing protocols in WSN, the idea behind it is very simple but the experiments are thorough.
Protocols in WSN aim to establish a routing tree rooted at base station.
There exists some bursty links — links temporarily stable — in WSN.
We may use these links to transmit data with less hops. The paper builds a Short-Term Link Estimator.

The algorithm comprised 4 steps:
1. Reliability:
If one node overhears another node several times, we guess there are temporarily reliable link.
2. Feasibility:
The overhearing node queries the routing protocols and
find the overhearing packet could be transmitted by this node faster(i.e. ),
then the bursty link is feasible
3. Announcement:
It informs the sender-node that it can use the bursty link to transmit data faster
4. Unavaibility
Once failure in receiving/transmitting data, the link is announced unavailable.
We use ETX metric — the number of retransimission required to a successful transmission — to measure the quality of a link.

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Distance-based Representative Skyline.[3]

December12

Distance-based Representative Skyline.
Yufei Tao (CUHK), Ling Ding (CUHK), Xuemin Lin (UNSW), Jian Pei (SFU)
ICDE 2009
Guohua LI

This paper maybe the hardest one in today’s seminar.
[Background]
Review of Skyline
Given a set D of multidimensional points skyline consists of the points that are not dominated by any other points.
X Dominates Y means X is smaller than Y in all dimensions.
In another word, skyline point implys that there at least one dimension of this point is smaller than others while other dimensions not.

What’s the k-representive skyline?
In this paper, it is defined by Representation Error — the maximum value of non-representive skyline point to its nearest skyline point.
Given Skyline S={s_1,s_2,…,s_n}
In 2-D space, the optimal substurue is obvious. We devide the skyline into two parts – one part consists of the first (k-1) representive skyline, another part consist of the kth skyline and its neighours.
The kth skyline
the other consists of the last one.
[Intuition]
For every non-representive skyline point, there is a nearby representive skyline point.
[Solution]
DP in 2D case.

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Indexing Boolean Expressions.[2]

December12

Indexing Boolean Expressions
Steven Whang (Stanford Univ.), Chad Brower (Yahoo! Research), Jayavel Shanmugasundaram (Yahoo! Research), Serguei Vassilvitskii (Yahoo! Research), Erik Vee (Yahoo! Research), Ramana Yerneni (Yahoo! Research), Hector Garcia-Molina (Stanford Univ.)
VLDB 2009 [slides]
Xianming LIU

The problem is simple but practical. I like this type of problems.

[Problem]
Input: Given a assignment, a large set of Boolean Expressions(BE)
Output: All satisified BEs  under the assignment.

[Application]
Online Advertisig:
A BE represents an advertiser’s user targeting requirements.
An assignment of value to attributes represents the characteristics of a user visiting an online page.
Publish/Subscribe System:
A BE represents a subscription.
An assignment of value to attributes represents an event.

[Solution]
Build a inverted index [Atomic Expression, BE position list] to index all expressions.
- Conjunction: Merge sorted list
- DNF: find the
- CNF:
What is the key difference between the IR joining algorithms

posted under Seminar | No Comments »

{Seminar}2009.12.12. – Light-weight, Runtime Verification of Query Sources.[1]

December12

Light-weight, Runtime Verification of Query Sources.
Tingjian Ge (Brown U), Stan Zdonik (Brown U).
ICDE 2009
Lei YU

Background
Data owner outsources their data to the Network Storage System such as AWS.
The storage provider should ensure the underlying data not been modified by attackers
This paper address arbitrary modification of data incluing changes that may pass the DB.
In a world, it deals with modification in file system.

Q: What about Secure File system? What is the difference?
A: They are general purpose, not full using DB semantics? (No time to review, see paper)

“The basic idea is that we build one or more Merkle hash tree(s) for each sensitive DB object (e.g. specifica attributes and their associate index).
Then at query execution time, as we read data from the disk,
we verify that a minimum subset of “referenced” bytes from the sensitive DB objects is authentic,
thus providing that the data source and query result and both correct and complete
relateive to the initial “good state” from which the Merkle trees were built.”

The Merkle Hash Tree is a tree structure originally proposed for efficient authentication of equality queries in a DB sorted on the query attribute.
Every DB record corresponds to a distinct leaf node.
Each leaf node stores a hash value for the binary representation of its record.
The tree is constructed bottom-up, with each internal node storing the hash value of the concatenation of the hash value of its children.
With the root of the hash tree *and* the encryption key, we compute the signature of sensitive data.
*Query*
As a query runs, we need to verify it with Hash Tree by computing the signature again.
The server verifies the authenticity of the data it used from disk by computing hash values in a bottom-up manner, all the way up to the root.
Then we compared the signature with original signature to verified the authenticity.

*Update*
If we insert a data, we need to update the corresponding hash value in the tree.
The problem is if only part of the data modified, is it failure for all records?

*Organization*
Step 1. We split each page of a sensitive DB objects into several sub-blocks and construct a small in-page hash tree.
Step 2. All these in-pages hash trees make up a big hash tree.

SHA-1 is faster when the input is bigger message blocks

Q: Why we have to split a page?
A: Choose a fanout such that each hash operation is applied to a number of bytes that is close to the sub-block size

Q: How to split a page (i.e. what’s sub-block size of a page)?
A: In the experiment, try 3 size (1024, 512 and 256 bytes)
Tradeoff between time and space.
Small Size: quicker but bigger
Big Size: slower but smaller

Q: OK, then how to organize the in-page hash page ? (i.e. #fanout=?)
A: each hash operation applied to a number of bytes that is close to the sub-block size.

*Concurrency Control*
Two basic problems in CC, sync and reader/writer, merge some update operation to improver performance.
4 Phases in Writer
Synchronization Primitives and Reader/Writer
Write Transaction
Verification
Mark
Update
Helping

Q: Does the tree need to be balanced?

娱乐:
八卦一下 Merkle, 似乎很多数据结构都是以这个作为开头的
钻石时代

posted under Seminar | No Comments »
« Older Entries