I have a column in my database that is of type long text and holds xml content. I was wondering when I define the entity record is there a way to provide a custom extractor that will take in the xml and return rows with appropriate fields to be indexed.
Thank you, Peri Subrahmanya On 4/26/13 12:24 PM, "Shawn Heisey" <s...@elyograg.org> wrote: >On 4/25/2013 9:00 AM, xiaoqi wrote: >> i using DIH to build index is slow , when it fetch 2 million rows , it >>will >> spend 20 minutes , very slow. > >If it takes 20 minutes for two million records, I'd say it's working >very well. I do six simultaneous MySQL imports of 13 million records >each. It takes a little over 3 hours on Solr 3.5.0, a little over four >hours on Solr 4.2.1 (due to compression and the transaction log). If I >do them one at a time instead of all at once, it will go *slightly* >faster for each one, but the overall process would take a whole day. >For comparison purposes, that's about 20 minutes each time it does 1 >million rows. Yours is going twice as fast as mine. > >Looking at your config file, I don't see a batchSize parameter. This is >a change that is specific to MySQL. You can greatly reduce the memory >usage by including this attribute in the dataSource tag along with the >user and password: > >batchSize="-1" > >With two million records and no batchSize parameter, I'm surprised you >aren't hitting an Out Of Memory error. By default JDBC will pull down >all the results and store them in memory, then DIH will begin indexing. > A batchSize of -1 makes DIH tell the MySQL JDBC driver to stream the >results instead of storing them. Reducing the memory usage in this way >might make it go faster. > >Thanks, >Shawn > > > > > *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.