CS262A Project Proposal
Approximate Text Addressing and Spam Filtering on P2P Systems
Feng Zhou(zf@cs), Li Zhuang(zl@cs)
10/10/2002 11:32AM
Motivation
Peer-to-peer overlay infrastructures such as Tapestry, CHORD, Pastry and CAN demonstrated the benefits of scalable, wide-area lookup services for Internet applications. These services are built on top of different distributed hash table algorithms, which efficiently maps a global name to the participating node that offers a object with the name. We now propose to build an additional layer on top of DHTs, called the Approximate Text Addressing (ATA) service. The Approximate Text Addressing service maps all approximately identical texts to an object somewhere in the peer-to-peer network.
The observation behind ATA is that text documents about a certain real world
object or topic are often copied verbatim or with few modifications on a wide-area network.
Examples include junk emails received by different people, same
news articles generated from a single company press release on different news
sites and, plaigarized scientific papers. A lot of application can be built if we can identify
these similaries, such as an anti-spam system to identify junk emails,
an automatic news service to group similar reports and let users discuss on a
specific news topic globally, a copy detection system to detect plaigarized
articles or even plaigarized computer programs.
Systems have been built to do approximate text copy-detection. However, most
of them are centralized systems with limited scalability.
Moreover the nature of these applications determines that the more people they serve,
the more useful they are to each user.
Thus building these applications in a peer-to-peer way is desirable because of
potentially more scalability and availability. It can also
possibly reduce network traffic because DHTs often exploit locality and cache
data. To make these applications truely scalable, an efficient approximate text
addressing service is vital. Developing such a general-purpose service is
the main part of our project.
We are also going to build an application on it: a scalable
collaborative junk email filtering service, in order to prove ATA's usefulness
and discover potential problems.
Approximate Text Addressing in DHTs
DHT-based p2p infrastructures provide name-based routing and directory service.
ATA, in contrast, provide routing and directory based on approximate text.
It locates an object according to a text document, so that approximately identical
text documents map to the same object.
To do that with the help of DHT, the intuitive process is,
Document --(1)--> global document ID --(2)--> object/record
The second mapping is done by DHT while the first one is done by the addition layer.
However, because we are providing approximate text addresing, we have to find a
way to map all similar documents to a unique document ID. Although we can normalize
the format of the document by removing extra spaces or mark-ups, this can not
solve the problem of small literal changes like word reordering and substitution.
So a simple standard hash function will not be suitable. Actually the document ID
could be a little bit different in that some DHTs provide proximity routing
functionality. So if we can map similar text documents to similar summary codes,
we can use proximity routing to make them route to the same node.
There is at least one existing summary code providing this feature,
namely the Nilsimsa code [1]. However the problem with
Nilsimsa is that codes of similar documents can be different at any position within the
256 bits. But proximity routing of DHTs can only handle IDs that differs at about the end.
We will work on ways to solve this problem. First, some variant of Nilsimsa code may
meet the requirement of proximity routing. Another
possible way is that we can transform the stream of text into frequency domain using
DFT or other transformations. The observation is that any small modification will
only result in changes in the high frequency components. So we can encode the components
in a way that similar documents have similar codes that differs only near the end.
The most promising way out of the problem seems to be using block text fingerprints. This
technique has been used in text watermarking and copy/plagiarism detection[2].
The basic idea is to generate a number of, say 10, checksums of fixed-length blocks, i.e.
fingerprints, for each document. If all or most of the fingerprints of two documents match,
they will probably be the same or highly similar. Using text fingerprints,
the process for fetching the data record (communication with the object is the same)
corresponding to a certain document will be,
- Calculate fingerprints
- Search with DHT for document IDs matching each of the fingerprints
- The Document ID that appears in more than a certain percentage of the search results is the one we are looking for.
- Fetch the record by document ID with DHT.
Junk Email Filtering with ATA
The goal is to build a service that helps people identify junk emails, or SPAM.
Only people can really identify SPAM! So we want to build a system that let
users collaborate with each other to fight SPAM. Junk emails are
identified by users through their mail user agent. Then each email is
mapped to a record in the p2p network using ATA.
An incoming email can be easily identified as either useful email or SPAM by
looking at the record, which contains information like how many people
claim it to be a junk email.
We are building the system with ATA/DHT because the effectiveness of the service relies on how large the number of users is.
Because of the scalability and availability of DHTs, a global scale anti-spam network can be built in this way,
without central bottleneck or single point of failure. Another potential advantage
of using peer-to-peer approach is reduced network traffic/latency. SPAM tends to exhibit locality.
So in this system,
a node in Europe probably do not need to query a node in North America for a certain
junk email, because records of popular junk emails will in most cases be cached
nearby.
Evaluation Approach
We are going to do correctness evaluation and performance evaluation. We do evaluation
of ATA itself and the junk email filter together in the context of SPAM filtering.
Simulation is our main method.
For correctness evaluation, there are several variables:
- Dataset. We can use real emails tagged by us as SPAM or normal emails or synthesized
emails with a certain SPAM distribution. An important question here is that we need to
know how much of emails in real world are SPAM?
-
User behavior. Input will be synthesized based on several variables like, whether they tag
SPAM in time, or tag normal emails as SPAM, or if there exist malicious users.
We can also do direct measurement if we have a running system.
For performance evaluation, we need to add a few variables: the cache and replication
policy employed by nodes in the network, the frequency of node joining/leaving.
The metrics to measure include at least: throughput scalability, network traffic, query latency.
Reference
[1]. Nilsimsa code, available at: http://lexx.shinn.net/cmeclax/nilsimsa.html
[2]. U. Manber, "Finding Similar Files in a Large File System," Usenix Winter
1994 Technical Conference, San Francisco (January 1994), pp. 1-10.