4 R's of DB Research

Reading, Rithmetic, Research and wRiting

[SDM 08] A Spamicity Approach to Web Spam Detection

December16

1. Backgroud

Two types of spam: link spam and term spam.

  • Term Spam: Add many keywords into one page and usually make them invisiable but searchable to cheat TF/IF
  • Link Spam: Construct links to boost pagerank score.

The paper deals with both conditions, but I’m only interested in the latter one.

i.e. Given a page p, determine whether p is link spam.

Related Paper:

  • B. Zhou. Mining page farms and its application in link spam detection. Master’s thesis. School of Computing Science, Simon Fraser University, 2007.
  • B. Zhou and J. Pei. Sketching landscapes of page farms. In SDM’07.
  • Bin Zhou, Jian Pei and Zhaohui Tang. “A Spamicity Approach to Web Spam Detection”. In Proceedings of the 2008 SIAM International Conference on Data Mining (SDM’08), Atlanta, Georgia, USA, April 24-26, 2008.

2. Intuition

Give a page p, what is the minimal essential sub-structure to compute PR(p)?

It consists of two component:

  • Directed Paths to p. All vertices in the path forced a vertices set H
  • All out-links of H. (Why? To reserve the transition probability of pages in H)

But, the structure is extremely large and even amount to the whole web graph. Thus, we impose some constraints to the sub-structure, capture local environment of page p, and make its size affordable.

Two constrains:

  • path length is not greater than k
  • PR(p,G(H)) is at least θ portion of PR(p,G)

The smallest subgraph under these constraints is called (θ, k)-farm of page p.

Now, it comes another interesting problem, extract the page farm of a given page p.

3. Problem

3.1. Assumption

What is a link spam?

The assumption is that: For a link spam p, Farm(p) try to make PR(p,Farm(p)) as high as possible.

We define the Spamicity(p) as ratio of PR(p,Farm(p)) against PR(p,OptFarm(p)).

**Link spam is a page whose spamicity exceed a threshold **

Given vertices and edges size of Farm(p), we can compute PR(p,OptFarm(p)) easily. Thus, the problem is how to extract Farm(p) and compute PR(p,Farm(p)).

3.2. Input

Given

  • a page p.
  • Web Graph G.
  • Parameters of page farm – (θ,k).
  • A threshold t.

3.3. Output

Spamicity(p) ? t

3.4. Solution

We can extract the (θ, k)-Farm(p) and then compute the utility. Unfortunately, the first problem is NP-hard. (Knapsack problem can reduce to it.) We choose approximation algorithms to extract the page farm of p.

3.4.1. Solution 1

Local Greedy Search Method

  1. Extract the (θ, k)-Farm(p):
    • Add pages with the highest page contribution to p into the farm until achieve a θ-portion of PR(p,G)
    • page contribution can be effective computed by page contribution
  2. Compute the Utility

3.4.2. Solution 2

Monotone Greedy Search Method Observation: If we construct farm(p) using the local greedy method(i.e add pages in the descending page contribution), the spamicity decreases monotonically. We adjust local greedy search method. Stop if

  • spamicity is lower than the threshold
  • OR all k-distance vertices have been added

4. Comments

  • Good: Local greedy search method, no need to know the whole web graph structure.
  • No theoretical guarantee on the approximation algorithm and it seems hard to proof it.
  • Experiment is on a spam test collection – 8,415 pages (normal: 7472, spam: 767, borderline: 176[Discarded]). Maybe more efficient on large web graph.

Reputation System and Sybil Attack

December9

1. Reputation System
Reputation is the opinion of the public toward a person, a group of people, or an organization.
Reputation system computes and publish reputation scores for a set of objects(e.g. products, services, goods or entities) within a community or domain, based on a collection of opinions that other entities hold about the objects.

In particular, in ebay system, user left feedback for each other after each transaction.
These feedback scores accumalate into the reputation score.
Then the reputation score assisted people make  decision making.

Typical Reputation system included:

Because its influence in decision making for users, fraudsters tend to gain extra reputation scores by creating a large number of pseudonymous entities.
There is a jargon the describe the attack — Sybil Attack.

2. Sybil Attack
“Sybil” represents dissociate identity disorder.
It orginates from a 1973 book about Ardell Mason’s treatment for dissociative identity disorder.
The Sybil attack is an attack wherein reputation system such as P2P networks. It attacks by register many times with multiple identities and then control enough of the space to capture particular traffic.

The key techniques to prevent sybil attack is validate techniques to make sure a one-to-one map between user and account.
Techiniques such as cerirification authority and weak secure IDs can be used.

3. Wait a minute, what’s the connection between you and your research?
The sybil attack or some variaties existed in many different scencerio with different names.
Two meathods to deal with the attack
- Prevention. Validate user’s identify by cerirification authority or weak secure IDs(e.g.IP) to prevent sybil attack.
- Detection. Once Sybil attack had occured, how can we detect them?
Attack detection is what I really want to deal with.

4. Related Work and their techniques

* Spam Detection in Web Graph
Web graph is a reputation system which link represents support to a web page.
In the web search area, there are many mature techniques to define the reputation score(e.g. PageRand and HITS).
Reading List:
Jian Pei, Bin Zhou, Zhaohui Tang, Hai Huang. “Data Mining Techniques for Web Spam Detection“. In Proceedings of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’08), Osaka, Japan, May 20-23, 2008.
* Data Conflict Resolution

References:
[1] “Sybil”,”Sybil attack”, “Reputation System”in Wikipedia
[2] Security and Trust part of Joseph M. Hellerstein’s slide Tutorial: Architectures and Algorithms for Internet-Scale (p2p) Data Management. VLDB 2004.

——–

以下是八卦(搜这个太花时间了,以后要克制好奇心,至少要延缓一下好奇心)
Sybil: (西比尔)另一个意思: 女预言家,常写作Sibyl,希腊语。Harry Potter中的占卜课老师就叫做Sibyl Trelawney.


关于她的传说, Sibyl是希腊神话中阿波罗神庙的女祭司,由于受到太阳神的眷顾而具有了预言未来的能力。
她拥有像沙子一样多的寿命,这是她要求阿波罗给她的,但是她忘了要求青春,以至于她后来唯一的渴望就是死。
(T.S.Eliot的长诗《荒原》的题记,“孩子们问西比尔要什么,西比尔回答:‘我要死’”)

后面一段貌似在某篇奇幻小说中看过。

{Outlier}Alpha – Reading Plan

December9

My master’s thesis work is anomaly detection in large dynamic graph. Thus, I plan to read some selected papers in this classic topic.

Outliers or Anomalies are patterns in data that do not conform a well-defined notion of normal behavior.

[Chandola et al. 2009] provides a brief overview in this area and a taxonomy on existing techniques.
The reading plan is to read classical or newest papers of this topic in DB community accoring to the taxonomy.

Techniques include:

* Classification
Two-class classify problem – Using classical model such as Neural Network, SVM and Bayesian Network.
Challenge: feature selection + train set data.

* NN approach
Outlier is objects whose near neighours are sparse.
Selected Readings:
- Yufei Tao, Xiaokui Xiao, and Shuigeng Zhou. Mining Distance-based Outliers from Large Databases in Any Metric Space. Proceedings of the 12th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD), pages 394-403, 2006.

* Cluster
Outlier is objects not located in any cluster.
* Statistical
* Information Theoretic
* Spectral

NN approcah and cluster approach are the commonest techniques DB community tends to work on. I will focus on it especially.

References:
Varun Chandola, Arindam Banerjee, and Vipin Kumar, “Anomaly Detection : A Survey“, ACM Computing Surveys, Vol. 41(3), Article 15, July 2009. [Slides]