[SDM 08] A Spamicity Approach to Web Spam Detection
1. Backgroud
Two types of spam: link spam and term spam.
- Term Spam: Add many keywords into one page and usually make them invisiable but searchable to cheat TF/IF
- Link Spam: Construct links to boost pagerank score.
The paper deals with both conditions, but I’m only interested in the latter one.
i.e. Given a page p, determine whether p is link spam.
Related Paper:
- B. Zhou. Mining page farms and its application in link spam detection. Master’s thesis. School of Computing Science, Simon Fraser University, 2007.
- B. Zhou and J. Pei. Sketching landscapes of page farms. In SDM’07.
- Bin Zhou, Jian Pei and Zhaohui Tang. “A Spamicity Approach to Web Spam Detection”. In Proceedings of the 2008 SIAM International Conference on Data Mining (SDM’08), Atlanta, Georgia, USA, April 24-26, 2008.
2. Intuition
Give a page p, what is the minimal essential sub-structure to compute PR(p)?
It consists of two component:
- Directed Paths to p. All vertices in the path forced a vertices set H
- All out-links of H. (Why? To reserve the transition probability of pages in H)
But, the structure is extremely large and even amount to the whole web graph. Thus, we impose some constraints to the sub-structure, capture local environment of page p, and make its size affordable.
Two constrains:
- path length is not greater than k
- PR(p,G(H)) is at least θ portion of PR(p,G)
The smallest subgraph under these constraints is called (θ, k)-farm of page p.
Now, it comes another interesting problem, extract the page farm of a given page p.
3. Problem
3.1. Assumption
What is a link spam?
The assumption is that: For a link spam p, Farm(p) try to make PR(p,Farm(p)) as high as possible.
We define the Spamicity(p) as ratio of PR(p,Farm(p)) against PR(p,OptFarm(p)).
**Link spam is a page whose spamicity exceed a threshold **
Given vertices and edges size of Farm(p), we can compute PR(p,OptFarm(p)) easily. Thus, the problem is how to extract Farm(p) and compute PR(p,Farm(p)).
3.2. Input
Given
- a page p.
- Web Graph G.
- Parameters of page farm – (θ,k).
- A threshold t.
3.3. Output
Spamicity(p) ? t
3.4. Solution
We can extract the (θ, k)-Farm(p) and then compute the utility. Unfortunately, the first problem is NP-hard. (Knapsack problem can reduce to it.) We choose approximation algorithms to extract the page farm of p.
3.4.1. Solution 1
Local Greedy Search Method
- Extract the (θ, k)-Farm(p):
- Add pages with the highest page contribution to p into the farm until achieve a θ-portion of PR(p,G)
- page contribution can be effective computed by page contribution
- Compute the Utility
3.4.2. Solution 2
Monotone Greedy Search Method Observation: If we construct farm(p) using the local greedy method(i.e add pages in the descending page contribution), the spamicity decreases monotonically. We adjust local greedy search method. Stop if
- spamicity is lower than the threshold
- OR all k-distance vertices have been added
4. Comments
- Good: Local greedy search method, no need to know the whole web graph structure.
- No theoretical guarantee on the approximation algorithm and it seems hard to proof it.
- Experiment is on a spam test collection – 8,415 pages (normal: 7472, spam: 767, borderline: 176[Discarded]). Maybe more efficient on large web graph.

