You are viewing the community [info]b659project

B659 Web Mining Project's Journal
 
[Most Recent Entries] [Calendar View] [Friends]

Below are the 13 most recent journal entries recorded in B659 Web Mining Project's LiveJournal:

    Friday, April 21st, 2006
    4:07 pm
    [dyrecorn]
    Project Discussion
    [info]elwoodthegreat and I met today to discuss the current state of the project and future directions.

    Points we discussed:

    For LJ parsing, each entry has its own URL - we can parse these out. The way I divided the pages, crawling the LJs are more like doing a search on how consistently on-topic the entire page is, rather than a history-based approach. Still, this is interesting.

    I suggested that in addition to a re-ranking, we also present a link off of each entry in the list that links to the WBM's copy of the version of the page that provided the highest acceleration. In this way, we're providing something closer to Google's original ranking - a list of the most highly-rated pages.

    The realization that prompted this was our awareness that the current project has drifted a little in scope. What we're presenting is not so much a competitor to Google's ranking. The reason for this is that the acceleration model can rank a page highly if it's been very on-topic in the past, even if the current version isn't. THis means that a user visiting the current version looking for immediate payoff may be confused by our ranking. What we're actually finding is something more akin to "which pages should I come back to / bookmark for consistent information on topic X?" The pages we rank highly have a high probability of presenting new information on a given topic over time.

    Thus, for our project, a user study would have to be done over a period of time, such as 6 months, comparing Google's current top rankings for a query to our top ranking over the same time period.

    This is one of the reasons I then proposed we keep the most highly-ranked copies, because that allows us to present something mroe akin to what Google is - an instant payoff of the desired information.
    1:56 pm
    [dyrecorn]
    Test Sets
    So, in keeping with one of our searches as "Google Pagerank", I created two communities and segmented two large articles on Google and Pagerank into multiple entries over them.

    They can be found here:

    http://community.livejournal.com/b659projecttest/
    (From Wikipedia's article)

    and here:

    http://community.livejournal.com/b659test2/
    (From a random site's article)

    We should be able to apply our algorithm to this for a more "controlled" evaluation, assuming time permits.
    Wednesday, April 19th, 2006
    12:38 pm
    [elwoodthegreat]
    J and I met about a week ago to discuss the acceleration model in place and determine what is left to be done on the project. We both thought that the model looked good theoretically and ran a few test queries:

    News America
    Bush Iraq
    Google PageRank


    I finally got around to generating the graphs for these three queries. They are rather large so I have only included links to them below:

    http://www.cs.indiana.edu/~cjcolvar/b659/newsamericaplot.png
    http://www.cs.indiana.edu/~cjcolvar/b659/bushiraqplot.png
    http://www.cs.indiana.edu/~cjcolvar/b659/googlepagerankplot.png


    One oddity that struck my eye was the unusually high spike in www.theonion.com on the News America graph. The page in question is: http://web.archive.org/web/20040305195417/http://www.theonion.com/

    The results outputted by our program reads:
    http://www.theonion.com/ 1.0 2.0 1081212857000 70.68654

    Somehow a match is being found even though the page only has 2 shingles. The result is www.theonion.com is obtaining a ridiculously high velocity (70.68654). This should probably be fixed to avoid the after effects of a velocity that high.
    Tuesday, April 4th, 2006
    7:56 pm
    [elwoodthegreat]
    I felt like rewriting the program to make it more modular for the future addition of crawlers for different archives than just the Wayback Machine. While I was at it, I decided to take a shot at implementing the acceleration model.

    Here is the body of the accelerate() method:
    double dragConstant = 0.2;
    public void accelerate(double value, double mass, double time){
    if(mass==0){
    return;
    }else{
    velocity += (((value - (dragConstant * velocity))/mass) * (time/100000000));
    }

    if(velocity < 0){
    velocity = 0;
    }
    }

    Here value (or Force) is the number of matches found on an archived page scaled by its dissimilarity to the previous page, mass is the number of shingles the page has, and time is the amount of time between this archived page and the previous page (in milliseconds).

    I experimented briefly with different dragConstants on the comparison query Google OR FUD between slashdot.org and boingboing.net

    The following graphs show the outcome at dragConstant = 0.01, 0.1, and 0.2.







    I think 0.2 worked fairly well in this case and might even think about increasing it, but I haven't tested it on queries that produce mediocre or results worse than these two pages.

    More graphs and results to come soon. Please feel free to give us any queries you want tested.
    Monday, March 27th, 2006
    1:44 pm
    [elwoodthegreat]
    Lots of updates to the program:

    From just before break:

    I fixed that problem we were having with StringNode's not being what I thought they were. I did it by changing the website into a all lowercase string of viewable text (StringBean) and then counting occurrences of keywords or quoted strings by using regular expressions. How I handled the boolean expressions is as follows:

    keyword1 AND keyword2 = min(count(keyword1),count(keyword2))
    keyword1 OR keyword2 = count(keyword1) + count(keyword2)
    keyword1 NOT keyword2 = if(count(keyword2)>0) return 0 else count(keyword2)

    Nesting of boolean expressions is done the same as above.
    This seemed like the best definitions for these expressions in my opinion.

    With this and the shingling, I performed the query:
    Bloomington Indiana public (Transportation OR Transit)
    and came up with decent results except that some pages were static and had many references versus www.bloomingtontransit.com which only had a few per archived page. So this looks really good considering that this should all be resolved when the acceleration/velocity model gets added.

    If we want to do direct comparisons between Google's ranking and this program, we will have to transform the query syntax to be more Google-like at runtime before submitting it to the Google.


    From today:

    Fixed small but significant bug with matching system implemented above. This partly fixed the problem mentioned above, but the acceleration model should hopefully rectify it completely.

    Added a comparison mode so the user can enter the urls to be queried.
    For example:
    The query "FUD" on http://slashdot.org and http://boingboing.net resulted in the following ranking:
    #1: http://boingboing.net with a score of 0.09062081019654906
    #2: http://slashdot.com with a score of 0.035449470969946446

    I also reduced the memory overhead a little so processing the above example is possible.

    Other things to work on include:
    -Implement the acceleration model
    -Making the number of Google results to rank user specified
    -Converting the query to Google syntax before executing it
    -Adding a nifty interface
    Monday, March 6th, 2006
    10:51 am
    [elwoodthegreat]
    Updated Prototype
    Last week, I updated the prototype:
    1) Refactored the code so it will be much easier to extend and modify
    2) Added sampling normalization so only the sparsest sampling rate will be used.
    In doing this I measured the sample density as:
    # of entries / (Last entry - First entry)
    I thought this might be better than:
    # of entries / (Current time - First entry)
    due to the fact that a page might have stopped existing in the archive. Although the chances that we have to deal with a page like this is dependent on the recency of Google's index.

    Next things to do are integrate the acceleration model and scaling based upon page change.
    Wednesday, February 22nd, 2006
    5:34 pm
    [dyrecorn]
    Future work
    Fil notes we could also use this to do page analysis to reveal the most important concepts in a given site.

    I note we could then use vectors of these terms to do community building in a group of sites.

    TF*IDF*TP (term preservation)?

    Will use term frequency for the normal project.
    5:22 pm
    [dyrecorn]
    Spoke with Fil today:

    He points out the the sampling bias causes us to have normalization problems, which we should address before we look into adding other metrics.

    He suggests that we only take, say 10 images for each page. But what if a page has existed for 10 years vs one that has only existed for 1 year. He suggests possibly only chosing 1 update per month. Normalize based on a time unit, say 1 per 3 months or 1 per 3 months. Minimum sampling (with a max) based on the sparsest one.

    We can probably use shingling for duplicate detection.

    Ideas to score a page higher based on the amount of change as well as the frequency of term occurence.
    Friday, February 17th, 2006
    5:05 pm
    [dyrecorn]
    Meeting Update
    [info]elwoodthegreat and I met today to discuss the project.

    After a code review, we came to some conclusions.

    I need to review and add onto the scoring code and the "acceleration" code that updates a page's score.

    We decided that we might encounter a problem if a given page has many updates stored in the WBM that are generally cosmetic - i.e. last modified changes and such. This would give it a high score in our algorithm because those terms would remain in the page over a larger period of time. We decided that we should talk to Fil about ways to store and detect duplicates, and only score a page if it deviated by a certain amount from previous versions. (We might only have to store the previous version for this) Also, should we score a page based on the amount by which it deviates from a previous version?

    Some of the math in the page scoring seems possibly off since we're getting a lot of 0s, and it seems like we shouldn't.

    We decided that hitting the WBM with 10 threads at a time is not that bad, so we're leaving it as is so that it runs faster. If we ever need to run multiple copies we might change it to sequential so we're not bombarding the WBM.
    Monday, February 13th, 2006
    10:04 am
    [elwoodthegreat]
    Clustering Idea
    While reading a part of Chapter 4, I was considering the following approach:

    Find the centroid of the TFIDF space of all pages in the archive for a given result page
    Rank based upon distance or cosine similarity between query and result centroids

    This would not favor those that "changed and mentioned the query" or account for recency at all, but it would be another interesting measure that seems somewhat intuitive.
    9:47 am
    [elwoodthegreat]
    Prototype and Initial Results
    I finished coding a prototype about a week ago and later added AND,OR,NOT, and () to the pre-existing keyword queries. Right now the velocity model is not in place, but should be easy to add.

    At the moment, this is how it works:
    1) Run the project and pass a query as a parameter (make sure to include \" for quotes since " will just mark the start and end of a command-line parameter)
    2) Google API grabs the top 10 results for that query
    3) A filter is setup based on the query
    4) A thread is spawned for each Google result
    5) Each thread queries the Wayback Machine - if it is not in the archive or it excluded itself and error message is returned
    6) Each thread crawls every archived copy of the page returned by the Wayback Machine applying the filter and counting the number of string nodes left
    7) The total number of query hits for a given page is averaged over the number of archived pages
    8) Each thread returns separately and prints out its results before doing so

    The main framework should be done with minor tweaking needed as we integrate different similarity models. One issue to consider is what to do when Google returns a single archived page from an updating source like a single blog entry from a blog. This should result in a static page with respect to the archived copies in the Wayback Machine. Maybe in addition to favoring pages that continuously mention the query, we should also favor those that change and mention the query.


    Here are some initial results based on this setup:

    Query: ("Chess Club") AND Kasparov

    Google API Results:
    #1: http://www.hullchessclub.com/
    #2: http://www.download-free-games.com/chess_game_download/kasparov_chessmate.htm
    #3: http://www.gtryfon.demon.co.uk/bcc/drama/kaspvsrestofworld/kaspvsrest.htm
    #4: http://www.gtryfon.demon.co.uk/bcc/drama/kaspvsrestofworld/kasparovquiz/barnet_chess_club_kasparov_quiz.htm
    #5: http://www.deafgamers.com/kasparov_chessmate.htm
    #6: http://www-ims.oit.umass.edu/chess/chess1.shtml
    #7: http://www.burlingamechessclub.com/
    #8: http://chess.about.com/cs/playerspeople/v/vid03lgk.htm
    #9: http://www.chessbase.com/newsdetail.asp?newsid=1943
    #10: http://www.chessbase.com/newsdetail.asp?newsid=1365

    Prototype Results:
    #1: http://web.archive.org/*/http://www.chessbase.com/newsdetail.asp?newsid=1943: 3.0
    #2: http://web.archive.org/*/http://www.deafgamers.com/kasparov_chessmate.htm: 2.0
    #3: http://web.archive.org/*/http://www.download-free-games.com/chess_game_download/kasparov_chessmate.htm: 1.0
    #4: http://web.archive.org/*/http://www.hullchessclub.com/: 0.9302325581395349
    #5: http://web.archive.org/*/http://www.gtryfon.demon.co.uk/bcc/drama/kaspvsrestofworld/kaspvsrest.htm: 0.9
    #6: http://web.archive.org/*/http://www.gtryfon.demon.co.uk/bcc/drama/kaspvsrestofworld/kasparovquiz/barnet_chess_club_kasparov_quiz.htm: 0.8888888888888888
    #7: http://web.archive.org/*/http://www-ims.oit.umass.edu/chess/chess1.shtml: 0.875
    #8: http://web.archive.org/*/http://www.chessbase.com/newsdetail.asp?newsid=1365: 0.75
    #9: http://web.archive.org/*/http://www.burlingamechessclub.com/: 0.0
    #10: Error or no matches on http://web.archive.org/*/http://chess.about.com/cs/playerspeople/v/vid03lgk.htm


    Query: kasparov OR "State of the Union"

    Google API Results:
    #1: http://www.chesscorner.com/worldchamps/kasparov/kasparov.htm
    #2: http://en.wikipedia.org/wiki/Garry_Kasparov
    #3: http://www.research.ibm.com/deepblue/
    #4: http://www.whitehouse.gov/news/releases/2003/01/20030128-19.html
    #5: http://www.whitehouse.gov/news/releases/2002/01/20020129-11.html
    #6: http://classic.zone.msn.com/kasparov/
    #7: http://www.sonypictures.com/movies/triplex/
    #8: http://www.chesschamps.com/
    #9: http://www.imdb.com/title/tt0329774/
    #10: http://www.kasparovchess.com/

    Prototype Results
    #1: http://web.archive.org/*/http://en.wikipedia.org/wiki/Garry_Kasparov: 40.0
    #2: http://web.archive.org/*/http://www.whitehouse.gov/news/releases/2002/01/20020129-11.html: 5.1568627450980395
    #3: http://web.archive.org/*/http://www.chesscorner.com/worldchamps/kasparov/kasparov.htm: 4.354838709677419
    #4: http://web.archive.org/*/http://www.imdb.com/title/tt0329774/: 3.810810810810811
    #5: http://web.archive.org/*/http://www.whitehouse.gov/news/releases/2003/01/20030128-19.html: 2.85
    #6: http://web.archive.org/*/http://www.chesschamps.com/: 1.0
    #7: http://web.archive.org/*/http://www.research.ibm.com/deepblue/: 0.9661016949152542
    #8: http://web.archive.org/*/http://classic.zone.msn.com/kasparov/: 0.9285714285714286
    #9: http://web.archive.org/*/http://www.sonypictures.com/movies/triplex/: 0.0
    #10: Error or no matches on http://web.archive.org/*/http://www.kasparovchess.com/
    Tuesday, January 31st, 2006
    10:44 am
    [dyrecorn]
    B659 - Project Proposal
    While many way of ranking results from a user query have been explored, most of these focus on some heuristic that only considers the most recent version of the page as seen by a crawler. Due to the impossible task of visiting every page in the Web even once, less effort has been made on doing any temporal analysis of web pages, as this requires a history of the page that must be built by re-visiting the page over a span of time. For small subsets of the Web this is feasible, but for the scale required by a major crawler, less so. In our project we propose to exploit an already existing archival resource to enable us to attempt some temporal measures.
    The resource we will be using is the Wayback Machine Internet Archive. Founded in 1996 to act as an Internet Library, the Wayback Machine archives multiple versions of many popular pages over a span of years. Interfaces are available for querying this resource, and we expect we will be able to make use of these.
    Our project will start with a user submitted query string. This query string will be sent to Google using the Google API, and we will receive the top 10 results from Google. We will then, for each resultant URL, use the Wayback Machine to apply a heuristic to measuring the URL’s value over time. Finally, we will re-rank the top 10 by this new score and compare the differences between our ranking and Google’s.
    Time permitting, we will measure several different heuristics, however, the general model we propose to use is an inertial model. In general, this works by treating a URL’s score as a velocity starting at 0 and subject to decay, and positive measures of a particular version of that URL at a particular time are treated as acceleration. So, for example, if we have 10 versions of a URL over time, the URL’s score will start at 0. If the first 3 pages are good, the URL will experience rapid “acceleration” and achieve a high “velocity”. However, if the next 7 versions are bad, no further acceleration will be applied and so the velocity will decay down over time. The final velocity will be the URL’s score.
    The way in which we actually measure the quality of a given version of a given page will also be another place where we might try several different measures. Most likely, we will initially try a naïve search in which we simply do keyword matching based on frequency and possibly proximity. Assuming we can complete this on schedule, we can move to more complicated measures, although it is unlikely that we will be able to use measures such as PageRank due to the large demand this would place on the Wayback Machine’s servers.
    Our results for this project, while unlikely to include a user study, will still illustrate whether our general concept provides a different and potentially useful way of looking at webpage ranking. If this turns out to be the case, expanding the work may provide an opportunity for future research.
    Monday, January 30th, 2006
    3:32 pm
    [dyrecorn]
    Welcome
    This journal is for a class project, as detailed here:

    http://informatics.indiana.edu/fil/Class/b659/
About LiveJournal.com