Re: Generating large datasets for Solr proof-of-concept

Lance Norskog Thu, 15 Sep 2011 20:19:48 -0700

http://aws.amazon.com/datasets


DBPedia might be the easiest to work with:
http://aws.amazon.com/datasets/2319

Amazon has a lot of these things.
Infochimps.com is a marketplace for free & pay versions.


Lance

On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal <pulkitsing...@gmail.com>wrote:

> Ah missing } doh!
>
> BTW I still welcome any ideas on how to build an e-commerce test base.
> It doesn't have to be amazon that was jsut my approach, any one?
>
> - Pulkit
>
> On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal <pulkitsing...@gmail.com>
> wrote:
> > Thanks for all the feedback thus far. Now to get  little technical about
> it :)
> >
> > I was thinking of feeding a file with all the tags of amazon that
> > yield close to roughly 50000 results each into a file and then running
> > my rss DIH off of that, I came up with the following config but
> > something is amiss, can someone please point out what is off about
> > this?
> >
> >    <document>
> >        <entity name="amazonFeeds"
> >                processor="LineEntityProcessor"
> >                url="file:///xxx/yyy/zzz/amazonfeeds.txt"
> >                rootEntity="false"
> >                dataSource="myURIreader1"
> >                transformer="RegexTransformer,DateFormatTransformer"
> >                >
> >            <entity name="feed"
> >                    pk="link"
> >                    url="${amazonFeeds.rawLine"
> >                    processor="XPathEntityProcessor"
> >                    forEach="/rss/channel | /rss/channel/item"
> >
> >
> transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
> > ...
> >
> > The rawline should feed into the url key but instead i get:
> >
> > Caused by: java.net.MalformedURLException: no protocol:
> > null${amazonFeeds.rawLine
> >        at
> org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
> >
> > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2
> rollback
> > INFO: start rollback
> >
> > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter
> rollback
> > SEVERE: Exception while solr rollback.
> >
> > Thanks in advance!
> >
> > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
> > <markus.jel...@openindex.io> wrote:
> >> If we want to test with huge amounts of data we feed portions of the
> internet.
> >> The problem is it takes a lot of bandwith and lots of computing power to
> get
> >> to a `reasonable` size. On the positive side, you deal with real text so
> it's
> >> easier to tune for relevance.
> >>
> >> I think it's easier to create a simple XML generator with mock data,
> prices,
> >> popularity rates etc. It's fast to generate millions of mock products
> and once
> >> you have a large quantity of XML files, you can easily index, test,
> change
> >> config or schema and reindex.
> >>
> >> On the other hand, the sample data that comes with the Solr example is a
> good
> >> set as well as it proves the concepts well, especially with the stock
> Velocity
> >> templates.
> >>
> >> We know Solr will handle enormous sets but quantity is not always a part
> of a
> >> PoC.
> >>
> >>> Hello Everyone,
> >>>
> >>> I have a goal of populating Solr with a million unique products in
> >>> order to create a test environment for a proof of concept. I started
> >>> out by using DIH with Amazon RSS feeds but I've quickly realized that
> >>> there's no way I can glean a million products from one RSS feed. And
> >>> I'd go mad if I just sat at my computer all day looking for feeds and
> >>> punching them into DIH config for Solr.
> >>>
> >>> Has anyone ever had to create large mock/dummy datasets for test
> >>> environments or for POCs/Demos to convince folks that Solr was the
> >>> wave of the future? Any tips would be greatly appreciated. I suppose
> >>> it sounds a lot like crawling even though it started out as innocent
> >>> DIH usage.
> >>>
> >>> - Pulkit
> >>
> >
>



-- 
Lance Norskog
goks...@gmail.com

Re: Generating large datasets for Solr proof-of-concept

Reply via email to