Unhierarchical distribution of web-search-engine-index. - P2P-Zone

ormm · 11-05-05, 07:37 AM

Open letter to the developer and supporters of free p2p technologies.

I am developing a search engine spidering www for ogg and mp3 files, at the time I am writing this is c. 70 000 files indexed. (http://openmusic.op.funpic.org )

This is a brief discussion of ways to distribute this index in an unhierarchical way built over existing p2p networks, it is submitted here as a request for opinions on how it could be implemented without violating and/or disturbing protocol specifications and usability of the networks.

The motivation for implementing this is that many unsigned artists releases their music on the internet but those files are rarely available at p2p networks. This design is also in contrast to central servers since it will take too much computer power to provide such service to the whole worlds p2p-networks.

The spidering and updating of index is centrally made by our servers, with an engine released under GNU/GPL an sourceforge. It could be done in an unhierarchical way but we insist that p2p-clients must remain free from such addons and only implementing things that directly benefits the user.

An conceptual approach on the problem:
The index is split in many small files (ex. 100) which will take c. 300kb each, the index is ordered by the artist name and the client searching for an artist downloads the meta-file containing an url to it and additional redundant information about other artists. This makes it impossible to search for a particular song without knowing the artist but I consider it appropriate. The client shares that meta file and makes in this way it available for more users. The wanted targetfile is downloaded via http.

Inconvenience
This might cause a dissonance with clients that not is designed with this use in mind, since these files containing the index will be viewed by people browsing the host expecting to find "real" files and not a set of meta data.

This would be a vulnerable system to erroneous data and spam, but it is more a question about the p2p concept than this way of using it.

Worth to consider implementing or not?

Sincerely Johan Mattsson

You can find a part of the index here: http://openmusic.op.funpic.org/catalog/

Mazer · 12-05-05, 07:18 PM

Interesting ideas, but I have more questions than suggestions, I'm afraid.

Linking P2P networks to content available on the WWW makes sense. Which P2P networks would you be using? Are you trying to make the metadata available in this index serchable by these networks? The problem you'll run into there is that most networks only search file names and ID3 tags. If a peer hosted a chunk of the index the network would only look at that chunk's file name and not the data it contains.

I guess I'm not sure why the index needs to be decentralized. If your server already indexes ~70,000 links couldn't it do more? If someone wanted to search for music on the web they'd probably prefer to use your website rather than a P2P network. Usually when it comes to web hosting the factor that limits scalability isn't computing power since you could always get more servers to do the job. That of course costs more in hosting fees which is what usually determines the maximum capicity of a given web site.

Nevermind, I think I just answered my own question. Decentralizing the index would keep the operating costs down to a minimum.

I guess to do this you would need two things: a server to compile the index and feed it into the P2P network, and a specialized client that is able to host, search, and redistribute the index to other nodes. Such a client would probably act a lot like Freenet. The network protocol doesn't need to be specialized as far as I can see, Gnutella would probably do the job. The only thing is that everyone would need to be running this client in order for it to work.

P.S. Welcome to NU, ormm.

ormm · 13-05-05, 04:38 AM

"If a peer hosted a chunk of the index the network would only look at that chunk's file name and not the data it contains."

But you can determine which meta-file you need .

"Nevermind, I think I just answered my own question. Decentralizing the index would keep the operating costs down to a minimum."

Yes, this will ensure it stays "free" and is managed by the open source community.

I have done some experiments with gnutella, it works fine, but it is also possible to do it with xnap and in that case is it depending on the plugins installed, even thou it wold not work so nicly with opennap sonce the central server design depending on host browse.

11-05-05, 07:37 AM	#1
ormm Registered User Join Date: May 2005 Posts: 2	Unhierarchical distribution of web-search-engine-index. Open letter to the developer and supporters of free p2p technologies. I am developing a search engine spidering www for ogg and mp3 files, at the time I am writing this is c. 70 000 files indexed. (http://openmusic.op.funpic.org ) This is a brief discussion of ways to distribute this index in an unhierarchical way built over existing p2p networks, it is submitted here as a request for opinions on how it could be implemented without violating and/or disturbing protocol specifications and usability of the networks. The motivation for implementing this is that many unsigned artists releases their music on the internet but those files are rarely available at p2p networks. This design is also in contrast to central servers since it will take too much computer power to provide such service to the whole worlds p2p-networks. The spidering and updating of index is centrally made by our servers, with an engine released under GNU/GPL an sourceforge. It could be done in an unhierarchical way but we insist that p2p-clients must remain free from such addons and only implementing things that directly benefits the user. An conceptual approach on the problem: The index is split in many small files (ex. 100) which will take c. 300kb each, the index is ordered by the artist name and the client searching for an artist downloads the meta-file containing an url to it and additional redundant information about other artists. This makes it impossible to search for a particular song without knowing the artist but I consider it appropriate. The client shares that meta file and makes in this way it available for more users. The wanted targetfile is downloaded via http. Inconvenience This might cause a dissonance with clients that not is designed with this use in mind, since these files containing the index will be viewed by people browsing the host expecting to find "real" files and not a set of meta data. This would be a vulnerable system to erroneous data and spam, but it is more a question about the p2p concept than this way of using it. Worth to consider implementing or not? Sincerely Johan Mattsson You can find a part of the index here: http://openmusic.op.funpic.org/catalog/

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

12-05-05, 07:18 PM	#2
Mazer Earthbound misfit Join Date: May 2001 Location: Moses Lake, Washington Posts: 2,563	Interesting ideas, but I have more questions than suggestions, I'm afraid. Linking P2P networks to content available on the WWW makes sense. Which P2P networks would you be using? Are you trying to make the metadata available in this index serchable by these networks? The problem you'll run into there is that most networks only search file names and ID3 tags. If a peer hosted a chunk of the index the network would only look at that chunk's file name and not the data it contains. I guess I'm not sure why the index needs to be decentralized. If your server already indexes ~70,000 links couldn't it do more? If someone wanted to search for music on the web they'd probably prefer to use your website rather than a P2P network. Usually when it comes to web hosting the factor that limits scalability isn't computing power since you could always get more servers to do the job. That of course costs more in hosting fees which is what usually determines the maximum capicity of a given web site. Nevermind, I think I just answered my own question. Decentralizing the index would keep the operating costs down to a minimum. I guess to do this you would need two things: a server to compile the index and feed it into the P2P network, and a specialized client that is able to host, search, and redistribute the index to other nodes. Such a client would probably act a lot like Freenet. The network protocol doesn't need to be specialized as far as I can see, Gnutella would probably do the job. The only thing is that everyone would need to be running this client in order for it to work. P.S. Welcome to NU, ormm.

13-05-05, 04:38 AM	#3
ormm Registered User Join Date: May 2005 Posts: 2	"If a peer hosted a chunk of the index the network would only look at that chunk's file name and not the data it contains." But you can determine which meta-file you need . "Nevermind, I think I just answered my own question. Decentralizing the index would keep the operating costs down to a minimum." Yes, this will ensure it stays "free" and is managed by the open source community. I have done some experiments with gnutella, it works fine, but it is also possible to do it with xnap and in that case is it depending on the plugins installed, even thou it wold not work so nicly with opennap sonce the central server design depending on host browse.