http://aws.amazon.com/datasets
DBPedia might be the easiest to work with: http://aws.amazon.com/datasets/2319 Amazon has a lot of these things. Infochimps.com is a marketplace for free & pay versions. Lance On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal <pulkitsing...@gmail.com>wrote: > Ah missing } doh! > > BTW I still welcome any ideas on how to build an e-commerce test base. > It doesn't have to be amazon that was jsut my approach, any one? > > - Pulkit > > On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal <pulkitsing...@gmail.com> > wrote: > > Thanks for all the feedback thus far. Now to get little technical about > it :) > > > > I was thinking of feeding a file with all the tags of amazon that > > yield close to roughly 50000 results each into a file and then running > > my rss DIH off of that, I came up with the following config but > > something is amiss, can someone please point out what is off about > > this? > > > > <document> > > <entity name="amazonFeeds" > > processor="LineEntityProcessor" > > url="file:///xxx/yyy/zzz/amazonfeeds.txt" > > rootEntity="false" > > dataSource="myURIreader1" > > transformer="RegexTransformer,DateFormatTransformer" > > > > > <entity name="feed" > > pk="link" > > url="${amazonFeeds.rawLine" > > processor="XPathEntityProcessor" > > forEach="/rss/channel | /rss/channel/item" > > > > > transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow"> > > ... > > > > The rawline should feed into the url key but instead i get: > > > > Caused by: java.net.MalformedURLException: no protocol: > > null${amazonFeeds.rawLine > > at > org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) > > > > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 > rollback > > INFO: start rollback > > > > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter > rollback > > SEVERE: Exception while solr rollback. > > > > Thanks in advance! > > > > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma > > <markus.jel...@openindex.io> wrote: > >> If we want to test with huge amounts of data we feed portions of the > internet. > >> The problem is it takes a lot of bandwith and lots of computing power to > get > >> to a `reasonable` size. On the positive side, you deal with real text so > it's > >> easier to tune for relevance. > >> > >> I think it's easier to create a simple XML generator with mock data, > prices, > >> popularity rates etc. It's fast to generate millions of mock products > and once > >> you have a large quantity of XML files, you can easily index, test, > change > >> config or schema and reindex. > >> > >> On the other hand, the sample data that comes with the Solr example is a > good > >> set as well as it proves the concepts well, especially with the stock > Velocity > >> templates. > >> > >> We know Solr will handle enormous sets but quantity is not always a part > of a > >> PoC. > >> > >>> Hello Everyone, > >>> > >>> I have a goal of populating Solr with a million unique products in > >>> order to create a test environment for a proof of concept. I started > >>> out by using DIH with Amazon RSS feeds but I've quickly realized that > >>> there's no way I can glean a million products from one RSS feed. And > >>> I'd go mad if I just sat at my computer all day looking for feeds and > >>> punching them into DIH config for Solr. > >>> > >>> Has anyone ever had to create large mock/dummy datasets for test > >>> environments or for POCs/Demos to convince folks that Solr was the > >>> wave of the future? Any tips would be greatly appreciated. I suppose > >>> it sounds a lot like crawling even though it started out as innocent > >>> DIH usage. > >>> > >>> - Pulkit > >> > > > -- Lance Norskog goks...@gmail.com