Show All Reviews
Monitoring Distributed Data Streams Through Node Clustering
PaperID: 1161
 
Authors: Maria Barouti, Daniel Keren, Jacob Kogan, and Yaakov Malinovsky
 
Keywords: Applications of Clustering, Mining text documents, Text Mining
Reviewer 1's opinion
 
Ranking Criteria
Name 	Rank
Appropriateness to the Conference: 	Weak Accept
Originality: 	Marginal
Technical Strength: 	Weak Accept
Presentation: 	Weak Reject
Overall Evaluation: 	Marginal
 

Comments:
In this paper, the authors present a strategy to decrease the communication overhead while monitoring data streams in distributed systems.
In particular, they propose to apply a clustering algorithm to nodes in order to introduce an intermediate control step, before comunicating
the violation of a threshold value to the central node.
Contrary to classical clustering algorithms, the proposed algorithm has an objective function 
which aims to maximize the dissimilarity in the same cluster.

Although the main idea seems to be promising, the paper has some issues:
- All the sections appear full of equations, also referring to results obtained in other works. On the other hand, there is a lack of intuitive explanation of
some presented concepts. For this reason, the whole paper appears difficult to be followed.

- Section 2. The application in Text Mining appears not well connected to the rest of paper. Moreover, the equation (4), which should represent a measure
of the information gain of a feature, is not clear. In particular, it is not clear how the feature is represented in the formula, since the sum loops on
the values x11, x12, x21 and x22 which are independent on the given feature.
-----------------------------------------------------------------------------------------
i do emphasize in a couple of places that x11, x12, x21 and x22 depend on the selected "feature." 
-----------------------------------------------------------------------------------------
- Section 2. The meaning of the relevance label r is not clear.
----------------------------------------------------------------------------------------
I "shift" the blame to 2007 paper by sharfman, schuster, and keren
----------------------------------------------------------------------------------------

- Section 3. The proposed node clustering approach appears motivated by the results reported in Table 1. However, the authors only state
"The results immediately suggest to cluster nodes to further reduce communication load", without giving any explanation of the (possible) intuitions  
coming from such results.
----------------------------------------------------------------------------------------
I did attempt to insert some intuition considering clustering the "longest" and "shortest"
vectors.
----------------------------------------------------------------------------------------
- Section 4. In the Equation (8), the sum should loop on the values {1,...,k} instead of 
on the values {1,...,n}. 
----------------------------------------------------------------------------------------
I did change n for k
----------------------------------------------------------------------------------------
Moreover, the reported algorithm to
compute the history vectors hn(tj) is not clear. What the authors mean for "for t increasing 
from tj to tj+1"? (it seems to be a single iteration)
----------------------------------------------------------------------------------------
i insert a sentence to recall that tj stands for time of the entire dataset mean update,
t=tj+1,tj+2,...,t(j+1)
----------------------------------------------------------------------------------------
Moreover, in the same algorithm, a factor of 1/2 appears without any comment about its effect. 
Why do the author set such factor to 1/2 and not to another value?
----------------------------------------------------------------------------------------
i put a short remark that 1/2 is selected arbitrarily, and a more sophisticated approach
may lead to variable weights derived from the history of the process.
----------------------------------------------------------------------------------------
- Section 4. The proposed clustering algorithm seems to work in a bottom-up agglomerative fashion. However, the large set of formula let such simple approach
to appear much more complicated. The description should be highly simplified.
----------------------------------------------------------------------------------------
the current description contains two steps, and takes about 1/4 of a page to describe. 
i do not see how it can be "highly simplified." i, therefore, do not touch it.
----------------------------------------------------------------------------------------
- Section 5. The authors apply their clustering approach in a scenario with only 10 nodes. 
I think that the advantages of a clustering approach should be evaluated in a scenario with 
a much higher number of nodes. 
----------------------------------------------------------------------------------------
i take this as a suggestion for future work and/or a BSF proposal (exactly as the 1/2
weights statement).
----------------------------------------------------------------------------------------
Moreover, the authors do not report the number of identified clusters in the performed experiments.
----------------------------------------------------------------------------------------
the aim was to show that clustering may reduce communication, and the paper does this.
i did not think that somebody may also be interested in the number of clusters. this
information is available, and in the future we can include it. i suggest to avoid work
we consider non essential right now--the submission deadline is looming, and there are 
many things that should be done at this time (i think that the "right" choice of \alpha
is much more interesting and important issue).
----------------------------------------------------------------------------------------
Reviewer 2's opinion
 
Ranking Criteria
Name 	Rank
Appropriateness to the Conference: 	Strong Accept
Originality: 	Weak Accept
Technical Strength: 	Weak Accept
Presentation: 	Strong Accept
Overall Evaluation: 	Weak Accept
 

Comments:
The paper presents an alternative approach to monitoring data streams in a distributed system. This approach combines system theory techniques and clustering.
The difference with respect to other clustering algorithms that looks for similar data is that monitoring requires clusters with dissimilar vectors able to cancel each other as much as possible.
I am not an expert on this topic but the paper seems correct, it is clearly written and the explanation of methods appear to be clear and complete enough.
My doubts are due to the fact that the number of experiments and the comparison with state of the art seem quite limited.
----------------------------------------------------------------------------------------
the point about "the comparison with state of the art" is valid, but i suggest to postpone 
this issue to the proposal preparation.
----------------------------------------------------------------------------------------
Reviewer 3's opinion
 
Ranking Criteria
Name 	Rank
Appropriateness to the Conference: 	Weak Accept
Originality: 	Weak Accept
Technical Strength: 	Weak Accept
Presentation: 	Weak Accept
Overall Evaluation: 	Weak Accept
 

Comments:
The paper presents a new technique based on systems theory and clustering for distributed data stream monitoring. The paper is novel and technically correct. Simulation results could be strengthened by better presentation of main benefits of the new approach. A better validation of a new method by comparing it with a larger number of competing approaches would improve the paper significantly.
----------------------------------------------------------------------------------------
see my comment above.
----------------------------------------------------------------------------------------  

Powered by IAPRCommence