[jira] [Commented] (PIO-38) add Apache Parquet as a data source

2017-09-19 Thread Sara Asher (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172396#comment-16172396
 ] 

Sara Asher commented on PIO-38:
---

We can close this when PIO-71 is done

> add Apache Parquet as a data source
> ---
>
> Key: PIO-38
> URL: https://issues.apache.org/jira/browse/PIO-38
> Project: PredictionIO
>  Issue Type: New Feature
>  Components: Core
>Reporter: Wojciech Indyk
>  Labels: features
>
> Apache Parquet (https://parquet.apache.org/) is a columnar data store, native 
> for Apache Spark and very well suited to storing batch data (as an input) for 
> PredictionIO Engine.
> Parquet is very popular to archive clickstream, so it would enable to use 
> PredictionIO without additional import of data (and duplication) to HBase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIO-38) add Apache Parquet as a data source

2016-10-01 Thread Wojciech Indyk (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15539107#comment-15539107
 ] 

Wojciech Indyk commented on PIO-38:
---

Thanks [~k4hoo] for a suggestion. I'll try to prepare some code for this.

> add Apache Parquet as a data source
> ---
>
> Key: PIO-38
> URL: https://issues.apache.org/jira/browse/PIO-38
> Project: PredictionIO
>  Issue Type: New Feature
>Reporter: Wojciech Indyk
>  Labels: features
>
> Apache Parquet (https://parquet.apache.org/) is a columnar data store, native 
> for Apache Spark and very well suited to storing batch data (as an input) for 
> PredictionIO Engine.
> Parquet is very popular to archive clickstream, so it would enable to use 
> PredictionIO without additional import of data (and duplication) to HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIO-38) add Apache Parquet as a data source

2016-09-26 Thread Wojciech Indyk (JIRA)

[ 
https://issues.apache.org/jira/browse/PIO-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523861#comment-15523861
 ] 

Wojciech Indyk commented on PIO-38:
---

Hello [~Ziemin]! Sorry for late response.
I would like to have a chance to provide events to PredictionIO using my 
current place of storing events. As I can see PredictionIO can work with a pair 
of Elasticsearch+HBase. Therefore to use Elasticsearch as a backend I need to 
use HBase as an event-store. I don't know PredictionIO so good, so correct me 
if I'm wrong.
I don't want to use HBase, because it enlarges my technology stack and has no 
benefit in case of training model in batch. Parquet is more suitable to this 
case, when I append my archive of events once a day, then can use this data 
(subset) to train a recommendation model without duplication data in HBase.
Is it clear enough?

> add Apache Parquet as a data source
> ---
>
> Key: PIO-38
> URL: https://issues.apache.org/jira/browse/PIO-38
> Project: PredictionIO
>  Issue Type: New Feature
>Reporter: Wojciech Indyk
>  Labels: features
>
> Apache Parquet (https://parquet.apache.org/) is a columnar data store, native 
> for Apache Spark and very well suited to storing batch data (as an input) for 
> PredictionIO Engine.
> Parquet is very popular to archive clickstream, so it would enable to use 
> PredictionIO without additional import of data (and duplication) to HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIO-38) add Apache Parquet as a data source

2016-09-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PIO-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504135#comment-15504135
 ] 

Marcin ZiemiƄski commented on PIO-38:
-

[~woj_in] Could you elaborate more on what you would like to see in 
PredictionIO? Would it be simply another storage system for events using Apache 
Parquet or you mean some different kind of workflow in PredictionIO based on 
stream processing and making use of Parquet?

> add Apache Parquet as a data source
> ---
>
> Key: PIO-38
> URL: https://issues.apache.org/jira/browse/PIO-38
> Project: PredictionIO
>  Issue Type: New Feature
>Reporter: Wojciech Indyk
>  Labels: features
>
> Apache Parquet (https://parquet.apache.org/) is a columnar data store, native 
> for Apache Spark and very well suited to storing batch data (as an input) for 
> PredictionIO Engine.
> Parquet is very popular to archive clickstream, so it would enable to use 
> PredictionIO without additional import of data (and duplication) to HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)