4 R's of DB Research

Reading, Rithmetic, Research and wRiting

Reputation System and Sybil Attack

December9

1. Reputation System
Reputation is the opinion of the public toward a person, a group of people, or an organization.
Reputation system computes and publish reputation scores for a set of objects(e.g. products, services, goods or entities) within a community or domain, based on a collection of opinions that other entities hold about the objects.

In particular, in ebay system, user left feedback for each other after each transaction.
These feedback scores accumalate into the reputation score.
Then the reputation score assisted people make  decision making.

Typical Reputation system included:

Because its influence in decision making for users, fraudsters tend to gain extra reputation scores by creating a large number of pseudonymous entities.
There is a jargon the describe the attack — Sybil Attack.

2. Sybil Attack
“Sybil” represents dissociate identity disorder.
It orginates from a 1973 book about Ardell Mason’s treatment for dissociative identity disorder.
The Sybil attack is an attack wherein reputation system such as P2P networks. It attacks by register many times with multiple identities and then control enough of the space to capture particular traffic.

The key techniques to prevent sybil attack is validate techniques to make sure a one-to-one map between user and account.
Techiniques such as cerirification authority and weak secure IDs can be used.

3. Wait a minute, what’s the connection between you and your research?
The sybil attack or some variaties existed in many different scencerio with different names.
Two meathods to deal with the attack
- Prevention. Validate user’s identify by cerirification authority or weak secure IDs(e.g.IP) to prevent sybil attack.
- Detection. Once Sybil attack had occured, how can we detect them?
Attack detection is what I really want to deal with.

4. Related Work and their techniques

* Spam Detection in Web Graph
Web graph is a reputation system which link represents support to a web page.
In the web search area, there are many mature techniques to define the reputation score(e.g. PageRand and HITS).
Reading List:
Jian Pei, Bin Zhou, Zhaohui Tang, Hai Huang. “Data Mining Techniques for Web Spam Detection“. In Proceedings of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’08), Osaka, Japan, May 20-23, 2008.
* Data Conflict Resolution

References:
[1] “Sybil”,”Sybil attack”, “Reputation System”in Wikipedia
[2] Security and Trust part of Joseph M. Hellerstein’s slide Tutorial: Architectures and Algorithms for Internet-Scale (p2p) Data Management. VLDB 2004.

——–

以下是八卦(搜这个太花时间了,以后要克制好奇心,至少要延缓一下好奇心)
Sybil: (西比尔)另一个意思: 女预言家,常写作Sibyl,希腊语。Harry Potter中的占卜课老师就叫做Sibyl Trelawney.


关于她的传说, Sibyl是希腊神话中阿波罗神庙的女祭司,由于受到太阳神的眷顾而具有了预言未来的能力。
她拥有像沙子一样多的寿命,这是她要求阿波罗给她的,但是她忘了要求青春,以至于她后来唯一的渴望就是死。
(T.S.Eliot的长诗《荒原》的题记,“孩子们问西比尔要什么,西比尔回答:‘我要死’”)

后面一段貌似在某篇奇幻小说中看过。

{Outlier}Alpha – Reading Plan

December9

My master’s thesis work is anomaly detection in large dynamic graph. Thus, I plan to read some selected papers in this classic topic.

Outliers or Anomalies are patterns in data that do not conform a well-defined notion of normal behavior.

[Chandola et al. 2009] provides a brief overview in this area and a taxonomy on existing techniques.
The reading plan is to read classical or newest papers of this topic in DB community accoring to the taxonomy.

Techniques include:

* Classification
Two-class classify problem – Using classical model such as Neural Network, SVM and Bayesian Network.
Challenge: feature selection + train set data.

* NN approach
Outlier is objects whose near neighours are sparse.
Selected Readings:
- Yufei Tao, Xiaokui Xiao, and Shuigeng Zhou. Mining Distance-based Outliers from Large Databases in Any Metric Space. Proceedings of the 12th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD), pages 394-403, 2006.

* Cluster
Outlier is objects not located in any cluster.
* Statistical
* Information Theoretic
* Spectral

NN approcah and cluster approach are the commonest techniques DB community tends to work on. I will focus on it especially.

References:
Varun Chandola, Arindam Banerjee, and Vipin Kumar, “Anomaly Detection : A Survey“, ACM Computing Surveys, Vol. 41(3), Article 15, July 2009. [Slides]

Note to Dr. Haixun Wang’s talk

December8

It is some notes from Dr. Haixun WANG’s talk how to conduct quality research and write good papers in CCF DBTC mentoring program for NDBC 2009.

- What is research?
  Research is to identify a new problem, generalize your solution so it can solve a class of problems and then write a paper.
- How to do research?
  Research = Write + Rewrite
- How to find the topic?
  <Mundance? Try to generalize.>
  Four methods
  – Patchwork Extension
  – Matchmaking
  – Research on Demand
  – Use your creativity
- Write Techniques
  – Introduction: Staring from good stories and be specific
  – Motivation: “A picture is worth a thousand words”, if you can.
  – Example: As simple as possible (Occam’s razor)
  – Related Work:
    – Constraints in your methods (Good! But it’s often ignored by people!!!)
    – Drawbacks
- An Example (A SIGMOD 02 Paper)

The Beginning

December8

“… the only difference between a FA and a TM is that the TM, unlike the FA, has paper and pencil.
Think about it. It tells us something about the power of writing.
Without writing, you are reduced to the automata.
With writing, you have the extraordinary power of a Turing machine.

– Manuel Blum, Advice to a Beginning Graduate Student

Recently, I decide to start the English Blog to record what I read or thought in DB Research. I wish I started it earlier, but it seems not too late now :-) .

I always believe in writing is a powerful tool of thought. In fact, the power of writing has been emphazised again and again by many people. Perhaps, the most interesting one is the above one by Manuel Blum, professor in CMU who recieved the Turing Award for his contribution to complexity theory.

Besides, in a talk given by Dr. Haixun WANG in DBTC mentoring program for NDBC 2009, he asked a question “Can creativity be trained? How to train your creativity?”
The answer is “write, write and write“.
and
Research = Write + Rewrite
In the process of writing down what you have read or thought, you understand your problem better and better and finally maybe find more general or interesting problems.
In addition, it also open the way to dialogue with others.

Similar to best way to train a programmer is to code, the best way to train a researcher is to write.
Write what you you read as you read it. Write what you thought before you implent it
It is the mission of the blog.

posted under Misc | No Comments »

Hello world!

December4

欢迎来到撞击思想. 这是你的第一篇blog,开始你的撞击思想之旅吧!

posted under Misc | 1 Comment »
Newer Entries »