Re: DIH - stream file with solrEntityProcessor

Josh Lincoln Tue, 15 Oct 2013 16:41:03 -0700

ultimately I just temporarily increased the memory to handle this data set,
but that won't always be practical.


I did try the csv export/import and it worked well in this case. I hadn't
considered it at first. I am wary that the escaping and splitting may be
problematic with some data sets, so I'll look into adding XMLResponseParser
support to XPathEntityProcessor (essentially an option to
useSolrResponseSchema), though I have a feeling only a few other people
would be interested in this.

Thanks for the replies.


On Mon, Oct 14, 2013 at 11:19 PM, Lance Norskog <goks...@gmail.com> wrote:

> Can you do this data in CSV format? There is a CSV reader in the DIH.
> The SEP was not intended to read from files, since there are already
> better tools that do that.
>
> Lance
>
>
> On 10/14/2013 04:44 PM, Josh Lincoln wrote:
>
>> Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
>> the POST buffer being the issue. Thanks for suggesting I test this. The
>> full file is over a gig.
>>
>> Lance, I'm actually pointing SEP at a static file (I simply named the file
>> "select" and put it on a Web server). SEP thinks it's a large solr
>> response, which it was, though now it's just static xml. Works well until
>> I
>> hit the memory limit of the new solr instance.
>>
>> I can't query the old solr from the new one b/c they're on two different
>> networks. I can't copy the index files b/c I only want a subset of the
>> data
>> (identified with a query and dumped to xml...all fields of interest were
>> stored). To further complicate things, the old solr is 1.4. I was hoping
>> to
>> use the result xml format to backup the old, and DIH SEP to import to the
>> new dev solr4.x. It's promising as a simple and repeatable migration
>> process, except that SEP fails on largish files.
>>
>> It seems my options are 1) use the xpathprocessor and identify each field
>> (there are many fields); 2) write a small script to act as a proxy to the
>> xml file and accept the row and start parameters from the SEP iterative
>> calls and return just a subset of the docs; 3) a script to process the xml
>> and push to solr, not using DIH; 4) consider XSLT to transform the result
>> xml to an update message and use XPathEntityProcessor
>> with useSolrAddSchema=true and streaming. The latter seems like the most
>> elegant and reusable approach, though I'm not certain it'll work.
>>
>> It'd be great if solrEntityProcessor could stream static files, or if I
>> could specify the solr result format while using the xpathentityprocessor
>> (i.e. a useSolrResultSchema option)
>>
>> Any other ideas?
>>
>>
>>
>>
>>
>>
>> On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog <goks...@gmail.com> wrote:
>>
>>  On 10/13/2013 10:02 AM, Shawn Heisey wrote:
>>>
>>>  On 10/13/2013 10:16 AM, Josh Lincoln wrote:
>>>>
>>>>  I have a large solr response in xml format and would like to import it
>>>>> into
>>>>> a new solr collection. I'm able to use DIH with solrEntityProcessor,
>>>>> but
>>>>> only if I first truncate the file to a small subset of the records. I
>>>>> was
>>>>> hoping to set stream="true" to handle the full file, but I still get an
>>>>> out
>>>>> of memory error, so I believe stream does not work with
>>>>> solrEntityProcessor
>>>>> (I know the docs only mention the stream option for the
>>>>> XPathEntityProcessor, but I was hoping solrEntityProcessor just might
>>>>> have
>>>>> the same capability).
>>>>>
>>>>> Before I open a jira to request stream support for solrEntityProcessor
>>>>> in
>>>>> DIH, is there an alternate approach for importing large files that are
>>>>> in
>>>>> the solr results format?
>>>>> Maybe a way to use xpath to get the values and a transformer to set the
>>>>> field names? I'm hoping to not have to declare the field names in
>>>>> dataConfig so I can reuse the process across data sets.
>>>>>
>>>>>  How big is the XML file?  You might be running into a size limit for
>>>> HTTP POST.
>>>>
>>>> In newer 4.x versions, Solr itself sets the size of the POST buffer
>>>> regardless of what the container config has.  That size defaults to 2MB
>>>> but is configurable using the formdataUploadLimitInKB setting that you
>>>> can find in the example solrconfig.xml file, on the requestParsers tag.
>>>>
>>>> In Solr 3.x, if you used the included jetty, it had a configured HTTP
>>>> POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
>>>> included Jetty that prevented the configuration element from working, so
>>>> the actual limit was Jetty's default of 200KB.  With other containers
>>>> and these older versions, you would need to change your container
>>>> configuration.
>>>>
>>>> https://bugs.eclipse.org/bugs/****show_bug.cgi?id=397130<https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130>
>>>> <https**://bugs.eclipse.org/bugs/show_**bug.cgi?id=397130<https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130>
>>>> >
>>>>
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>>   The SEP calls out to another Solr and reads. Are you importing data
>>>> from
>>>>
>>> another Solr and cross-connecting it with your uploaded XML?
>>>
>>> If the memory errors are a problem with streaming, you could try "piping"
>>> your uploaded documents through a processor that supports streaming. This
>>> would then push one document at a time into your processor that calls out
>>> to Solr and combines records.
>>>
>>>
>>>
>

Re: DIH - stream file with solrEntityProcessor

Reply via email to