ultimately I just temporarily increased the memory to handle this data set, but that won't always be practical.
I did try the csv export/import and it worked well in this case. I hadn't considered it at first. I am wary that the escaping and splitting may be problematic with some data sets, so I'll look into adding XMLResponseParser support to XPathEntityProcessor (essentially an option to useSolrResponseSchema), though I have a feeling only a few other people would be interested in this. Thanks for the replies. On Mon, Oct 14, 2013 at 11:19 PM, Lance Norskog <goks...@gmail.com> wrote: > Can you do this data in CSV format? There is a CSV reader in the DIH. > The SEP was not intended to read from files, since there are already > better tools that do that. > > Lance > > > On 10/14/2013 04:44 PM, Josh Lincoln wrote: > >> Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out >> the POST buffer being the issue. Thanks for suggesting I test this. The >> full file is over a gig. >> >> Lance, I'm actually pointing SEP at a static file (I simply named the file >> "select" and put it on a Web server). SEP thinks it's a large solr >> response, which it was, though now it's just static xml. Works well until >> I >> hit the memory limit of the new solr instance. >> >> I can't query the old solr from the new one b/c they're on two different >> networks. I can't copy the index files b/c I only want a subset of the >> data >> (identified with a query and dumped to xml...all fields of interest were >> stored). To further complicate things, the old solr is 1.4. I was hoping >> to >> use the result xml format to backup the old, and DIH SEP to import to the >> new dev solr4.x. It's promising as a simple and repeatable migration >> process, except that SEP fails on largish files. >> >> It seems my options are 1) use the xpathprocessor and identify each field >> (there are many fields); 2) write a small script to act as a proxy to the >> xml file and accept the row and start parameters from the SEP iterative >> calls and return just a subset of the docs; 3) a script to process the xml >> and push to solr, not using DIH; 4) consider XSLT to transform the result >> xml to an update message and use XPathEntityProcessor >> with useSolrAddSchema=true and streaming. The latter seems like the most >> elegant and reusable approach, though I'm not certain it'll work. >> >> It'd be great if solrEntityProcessor could stream static files, or if I >> could specify the solr result format while using the xpathentityprocessor >> (i.e. a useSolrResultSchema option) >> >> Any other ideas? >> >> >> >> >> >> >> On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog <goks...@gmail.com> wrote: >> >> On 10/13/2013 10:02 AM, Shawn Heisey wrote: >>> >>> On 10/13/2013 10:16 AM, Josh Lincoln wrote: >>>> >>>> I have a large solr response in xml format and would like to import it >>>>> into >>>>> a new solr collection. I'm able to use DIH with solrEntityProcessor, >>>>> but >>>>> only if I first truncate the file to a small subset of the records. I >>>>> was >>>>> hoping to set stream="true" to handle the full file, but I still get an >>>>> out >>>>> of memory error, so I believe stream does not work with >>>>> solrEntityProcessor >>>>> (I know the docs only mention the stream option for the >>>>> XPathEntityProcessor, but I was hoping solrEntityProcessor just might >>>>> have >>>>> the same capability). >>>>> >>>>> Before I open a jira to request stream support for solrEntityProcessor >>>>> in >>>>> DIH, is there an alternate approach for importing large files that are >>>>> in >>>>> the solr results format? >>>>> Maybe a way to use xpath to get the values and a transformer to set the >>>>> field names? I'm hoping to not have to declare the field names in >>>>> dataConfig so I can reuse the process across data sets. >>>>> >>>>> How big is the XML file? You might be running into a size limit for >>>> HTTP POST. >>>> >>>> In newer 4.x versions, Solr itself sets the size of the POST buffer >>>> regardless of what the container config has. That size defaults to 2MB >>>> but is configurable using the formdataUploadLimitInKB setting that you >>>> can find in the example solrconfig.xml file, on the requestParsers tag. >>>> >>>> In Solr 3.x, if you used the included jetty, it had a configured HTTP >>>> POST size limit of 1MB. In early Solr 4.x, there was a bug in the >>>> included Jetty that prevented the configuration element from working, so >>>> the actual limit was Jetty's default of 200KB. With other containers >>>> and these older versions, you would need to change your container >>>> configuration. >>>> >>>> https://bugs.eclipse.org/bugs/****show_bug.cgi?id=397130<https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130> >>>> <https**://bugs.eclipse.org/bugs/show_**bug.cgi?id=397130<https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130> >>>> > >>>> >>>> >>>> Thanks, >>>> Shawn >>>> >>>> The SEP calls out to another Solr and reads. Are you importing data >>>> from >>>> >>> another Solr and cross-connecting it with your uploaded XML? >>> >>> If the memory errors are a problem with streaming, you could try "piping" >>> your uploaded documents through a processor that supports streaming. This >>> would then push one document at a time into your processor that calls out >>> to Solr and combines records. >>> >>> >>> >